WOE编码#

以下代码仅用于演示。由于系统安全问题,请勿直接在生产中使用。

建议使用 jupyter 运行本教程。

分箱是基于排序方法创建独立变量的桶。分箱帮助我们将连续变量转换为分类变量。

WOE分箱实现了针对二元目标变量的数值变量的分箱。

bin_total = bin_positives + bin_negatives
total_labels = total_positives + total_negatives
bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))
bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe

目前我们为垂直分割的数据集提供WOE编码。

让我们先加载一个样本数据集。

[1]:
import pandas as pd
import secretflow as sf
from secretflow.data.vertical import VDataFrame
from secretflow.utils.simulation.datasets import load_linear
[2]:
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
# similarly for woe in heu
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
2023-10-07 18:03:18,152 INFO worker.py:1538 -- Started a local Ray instance.
[3]:
parts = {
    bob: (1, 11),
    alice: (11, 22),
}
vdf = load_linear(parts=parts)
[4]:
label_data = vdf['y']
y = sf.reveal(label_data.partitions[alice].data).values

现在,我们已经准备好执行WOE分箱和替换。

[5]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution

binning = VertWoeBinning(spu)
bin_rules = binning.binning(
    vdf,
    binning_method="chimerge",
    bin_num=4,
    bin_names={alice: ['x14'], bob: ["x5", "x7"]},
    label_name="y",
)

woe_sub = VertBinSubstitution()
vdf = woe_sub.substitution(vdf, bin_rules)

# this is for demo only, be careful with reveal
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.
(SPURuntime pid=3788331) 2023-10-07 18:03:20.845 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3781429) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3781429) [2023-10-07 18:03:21.594] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(SPURuntime pid=3788331) 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'
(SPURuntime pid=3788331) 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3788331) 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'
(SPURuntime pid=3788331) 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3788330) 2023-10-07 18:03:22.849 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:34077'
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party alice.
(SPURuntime(device_id=None, party=bob) pid=3788331) 2023-10-07 18:03:25.092 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
(SPURuntime(device_id=None, party=alice) pid=3788330) 2023-10-07 18:03:25.107 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
           x11       x12       x13       x14       x15       x16       x17  \
0     0.241531 -0.705729 -0.020094 -0.493792  0.851992  0.035219 -0.796096
1    -0.402727  0.115744  0.468149 -0.735388  0.386395  0.712798  0.239583
2     0.872675 -0.559321  0.390246 -0.079604  0.225594 -0.639674  0.279511
3    -0.644718 -0.409382  0.141747 -0.682014  0.314084 -0.802476  0.348878
4    -0.949669 -0.940787 -0.951708  0.232566  0.272346  0.124419  0.853226
...        ...       ...       ...       ...       ...       ...       ...
9995 -0.031331 -0.078700 -0.020636 -0.493792  0.210120 -0.288943 -0.262945
9996  0.047039  0.965614 -0.921435 -0.251231  0.205778  0.155392  0.922683
9997  0.269438 -0.115586  0.928880  0.513349  0.269042 -0.331772  0.520971
9998  0.999325  0.433372 -0.805999  0.232566  0.072405  0.973399 -0.123470
9999 -0.203443  0.772931 -0.146181 -0.305728  0.274590  0.803816 -0.312047

           x18       x19       x20  y
0     0.810261  0.048303  0.937679  1
1     0.312728  0.526637  0.589773  1
2     0.039087 -0.753417  0.516735  0
3    -0.855979  0.250944  0.979465  1
4    -0.238805  0.243109 -0.121446  1
...        ...       ...       ... ..
9995 -0.847253  0.069960  0.786748  1
9996 -0.502486 -0.076290 -0.604832  1
9997 -0.424209  0.434947  0.998955  1
9998  0.914291 -0.473056  0.616257  1
9999 -0.602927 -0.021368  0.885519  0

[10000 rows x 11 columns]
            x1        x2        x3        x4        x5        x6        x7  \
0    -0.514226  0.730010 -0.730391  0.970483  0.038063 -0.800808 -0.082006
1    -0.725537  0.482244 -0.823223  0.202119  0.484290 -0.139781 -0.341216
2     0.608353 -0.071102 -0.775098 -0.391496  0.224127  0.082370 -0.341216
3    -0.686642  0.160470  0.914477 -0.269052  0.224127 -0.547841 -0.341216
4    -0.198111  0.212909  0.950474  0.775259 -0.590206 -0.840528 -0.742056
...        ...       ...       ...       ...       ...       ...       ...
9995 -0.367246 -0.296454  0.558596 -0.403504  0.542892  0.000142 -0.341216
9996  0.010913  0.629268 -0.384093 -0.552787  0.542892 -0.100838  0.071673
9997 -0.238097  0.904069 -0.344859 -0.687887 -0.103900  0.223052 -0.286633
9998  0.453686 -0.375173  0.899238  0.908135 -0.590206  0.524051  0.347251
9999 -0.776015 -0.772112  0.012110 -0.898067  0.182405 -0.500491  0.557853

            x8        x9       x10
0    -0.499206 -0.750112 -0.910640
1    -0.652901  0.438065  0.830206
2    -0.183506 -0.783842 -0.729929
3    -0.269405 -0.974268 -0.800515
4     0.800389  0.185542  0.183614
...        ...       ...       ...
9995 -0.470127 -0.247682 -0.552526
9996  0.592903 -0.577123 -0.811461
9997 -0.172245  0.713149 -0.184585
9998 -0.558997  0.610076 -0.862191
9999 -0.275658 -0.250420  0.518420

[10000 rows x 10 columns]

有时我们可能需要IV值。根据GitHub issue, secretflow/secretflow#565,发布分桶的ivs可能会泄漏标签信息。目前,我们选择将桶的iv值保存在标签持有者设备中。标签持有者可以选择

  1. 不共享iv信息

  2. 共享一些选中的iv信息

我们将演示如何共享特征iv。

回想一下,woe_rules是一个字典 {PYU: PYUObject},其中每个 PYUObject 本身是以下类型的字典

{
    "variables":[
        {
            "name": str, # feature name
            "type": str, # "string" or "numeric", if feature is discrete or continuous
            "categories": list[str], # categories for discrete feature
            "split_points": list[float], # left-open right-close split points
            "total_counts": list[int], # total samples count in each bins.
            "else_counts": int, # np.nan samples count
            "filling_values": list[float], # woe values for each bins.
            "else_filling_value": float, # woe value for np.nan samples.
        },
        # ... others feature
    ],
    # label holder's PYUObject only
    # warning: giving bin_ivs to other party will leak positive samples in each bin.
    # it is up to label holder's will to give feature iv or bin ivs or all info to workers.
    # for more information, look at: https://github.com/secretflow/secretflow/issues/565

    # in the following comment, by safe we mean label distribution info is not leaked.
    "feature_iv_info" :[
        {
            "name": str, #feature name
            "ivs": list[float], #iv values for each bins, not safe to share with workers in any case.
            "else_iv": float, #iv for nan values, may share to with workers
            "feature_iv": float, #sum of bin_ivs, safe to share with workers when bin num > 2.
        }
    ]
}
[6]:
# alice is label holder
dict_pyu_object = bin_rules[alice]


def extract_name_and_feature_iv(list_of_feature_iv_info):
    return [(d["name"], d["feature_iv"]) for d in list_of_feature_iv_info]


feature_ivs = alice(
    lambda dict_pyu_object: extract_name_and_feature_iv(
        dict_pyu_object["feature_iv_info"]
    )
)(dict_pyu_object)
[7]:
# we can give the feature_ivs to bob
feature_ivs.to(bob)
# and/or we can reveal it to see it
sf.reveal(feature_ivs)
[7]:
[('x14', 0.43219177635839423), ('x5', 0.37848298069087766), ('x7', 0)]
[8]:
feature_iv_info = sf.reveal(feature_ivs)
df = pd.DataFrame.from_records(feature_iv_info, columns=['feature', 'iv'])

如何解释特征iv?

  • 小于0.02 -> 对预测无用

  • 0.02到0.1 -> 弱预测能力

  • 0.1到0.3 -> 中等预测能力

  • 0.3到0.5 -> 强预测能力

  • 大于0.5 -> 可疑的预测能力

让我们选择前两个最重要的特征iv

[9]:
print(df.sort_values('iv', ascending=False).head(2))
  feature        iv
0     x14  0.432192
1      x5  0.378483

恭喜! 在本教程中,我们学习了如何

  1. 进行WOE编码

  2. 将一些iv信息分享给其他方

  3. 使用特征iv进行特征选择