WOE编码#

以下代码仅用于演示。由于系统安全问题,请勿直接在生产中使用。

建议使用 jupyter 运行本教程。

分箱是基于排序方法创建独立变量的桶。分箱帮助我们将连续变量转换为分类变量。

WOE分箱实现了针对二元目标变量的数值变量的分箱。

bin_total = bin_positives + bin_negatives
total_labels = total_positives + total_negatives
bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))
bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe

目前我们为垂直分割的数据集提供WOE编码。

让我们先加载一个样本数据集。

[1]:
import secretflow as sf
from secretflow.data.vertical import VDataFrame
from secretflow.utils.simulation.datasets import load_linear
[2]:
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
2023-10-07 18:03:23,909 INFO worker.py:1538 -- Started a local Ray instance.
[3]:
parts = {
    bob: (1, 11),
    alice: (11, 22),
}
vdf = load_linear(parts=parts)
[4]:
label_data = vdf['y']
y = sf.reveal(label_data.partitions[alice].data).values

现在,我们已经准备好执行WOE分箱和替换。

[5]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution

binning = VertWoeBinning(spu)
bin_rules = binning.binning(
    vdf,
    binning_method="chimerge",
    bin_num=4,
    bin_names={alice: [], bob: ["x5", "x7"]},
    label_name="y",
)

woe_sub = VertBinSubstitution()
vdf = woe_sub.substitution(vdf, bin_rules)

# this is for demo only, be careful with reveal
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.
(SPURuntime pid=3800442) 2023-10-07 18:03:26.686 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(_run pid=3793086) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3793086) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3793086) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3793086) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3793086) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3793086) [2023-10-07 18:03:27.436] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(SPURuntime pid=3800442) 2023-10-07 18:03:27.686 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34547} (0x0x32463c0): Connection refused [R1][E112]Not connected to 127.0.0.1:34547 yet, server_id=0'
(SPURuntime pid=3800442) 2023-10-07 18:03:27.686 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3800442) 2023-10-07 18:03:28.686 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34547} (0x0x32463c0): Connection refused [R1][E112]Not connected to 127.0.0.1:34547 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:34547 yet, server_id=0'
(SPURuntime pid=3800442) 2023-10-07 18:03:28.686 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3800441) 2023-10-07 18:03:28.687 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:57517'
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party alice.
(SPURuntime(device_id=None, party=bob) pid=3800442) 2023-10-07 18:03:30.776 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
(SPURuntime(device_id=None, party=alice) pid=3800441) 2023-10-07 18:03:30.794 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
           x11       x12       x13       x14       x15       x16       x17  \
0     0.241531 -0.705729 -0.020094 -0.486932  0.851992  0.035219 -0.796096
1    -0.402727  0.115744  0.468149 -0.697152  0.386395  0.712798  0.239583
2     0.872675 -0.559321  0.390246  0.000472  0.225594 -0.639674  0.279511
3    -0.644718 -0.409382  0.141747 -0.797517  0.314084 -0.802476  0.348878
4    -0.949669 -0.940787 -0.951708  0.187475  0.272346  0.124419  0.853226
...        ...       ...       ...       ...       ...       ...       ...
9995 -0.031331 -0.078700 -0.020636 -0.575713  0.210120 -0.288943 -0.262945
9996  0.047039  0.965614 -0.921435 -0.092970  0.205778  0.155392  0.922683
9997  0.269438 -0.115586  0.928880  0.430016  0.269042 -0.331772  0.520971
9998  0.999325  0.433372 -0.805999  0.311548  0.072405  0.973399 -0.123470
9999 -0.203443  0.772931 -0.146181 -0.195646  0.274590  0.803816 -0.312047

           x18       x19       x20  y
0     0.810261  0.048303  0.937679  1
1     0.312728  0.526637  0.589773  1
2     0.039087 -0.753417  0.516735  0
3    -0.855979  0.250944  0.979465  1
4    -0.238805  0.243109 -0.121446  1
...        ...       ...       ... ..
9995 -0.847253  0.069960  0.786748  1
9996 -0.502486 -0.076290 -0.604832  1
9997 -0.424209  0.434947  0.998955  1
9998  0.914291 -0.473056  0.616257  1
9999 -0.602927 -0.021368  0.885519  0

[10000 rows x 11 columns]
            x1        x2        x3        x4        x5        x6        x7  \
0    -0.514226  0.730010 -0.730391  0.970483  0.038063 -0.800808 -0.082006
1    -0.725537  0.482244 -0.823223  0.202119  0.484290 -0.139781 -0.341216
2     0.608353 -0.071102 -0.775098 -0.391496  0.224127  0.082370 -0.341216
3    -0.686642  0.160470  0.914477 -0.269052  0.224127 -0.547841 -0.341216
4    -0.198111  0.212909  0.950474  0.775259 -0.590206 -0.840528 -0.742056
...        ...       ...       ...       ...       ...       ...       ...
9995 -0.367246 -0.296454  0.558596 -0.403504  0.542892  0.000142 -0.341216
9996  0.010913  0.629268 -0.384093 -0.552787  0.542892 -0.100838  0.071673
9997 -0.238097  0.904069 -0.344859 -0.687887 -0.103900  0.223052 -0.286633
9998  0.453686 -0.375173  0.899238  0.908135 -0.590206  0.524051  0.347251
9999 -0.776015 -0.772112  0.012110 -0.898067  0.182405 -0.500491  0.557853

            x8        x9       x10
0    -0.499206 -0.750112 -0.910640
1    -0.652901  0.438065  0.830206
2    -0.183506 -0.783842 -0.729929
3    -0.269405 -0.974268 -0.800515
4     0.800389  0.185542  0.183614
...        ...       ...       ...
9995 -0.470127 -0.247682 -0.552526
9996  0.592903 -0.577123 -0.811461
9997 -0.172245  0.713149 -0.184585
9998 -0.558997  0.610076 -0.862191
9999 -0.275658 -0.250420  0.518420

[10000 rows x 10 columns]

Sometimes we may need the iv values. Releasing bin ivs will potentially leak label information according to issue secretflow/secretflow#565. Currently, we choose to save bin iv values in label holders device. It is up to label holder’s choice to

  1. share no iv information

  2. share some chosen iv information

我们将演示如何共享特征iv。

Recall that the bin_rules is a dictionary {PYU: PYUObject}, where each PYUObject itself is a dictionary of the following type:

{
    "variables":[
        {
            "name": str, # feature name
            "type": str, # "string" or "numeric", if feature is discrete or continuous
            "categories": list[str], # categories for discrete feature
            "split_points": list[float], # left-open right-close split points
            "total_counts": list[int], # total samples count in each bins.
            "else_counts": int, # np.nan samples count
            "filling_values": list[float], # woe values for each bins.
            "else_filling_value": float, # woe value for np.nan samples.
        },
        # ... others feature
    ],
    # label holder's PYUObject only
    # warning: giving bin_ivs to other party will leak positive samples in each bin.
    # it is up to label holder's will to give feature iv or bin ivs or all info to workers.
    # for more information, look at: https://github.com/secretflow/secretflow/issues/565

    # in the following comment, by safe we mean label distribution info is not leaked.
    "feature_iv_info" :[
        {
            "name": str, #feature name
            "ivs": list[float], #iv values for each bins, not safe to share with workers in any case.
            "else_iv": float, #iv for nan values, may share to with workers
            "feature_iv": float, #sum of bin_ivs, safe to share with workers when bin num > 2.
        }
    ]
}
[6]:
# alice is label holder
dict_pyu_object = bin_rules[alice]


def extract_name_and_feature_iv(list_of_feature_iv_info):
    return [(d["name"], d["feature_iv"]) for d in list_of_feature_iv_info]


feature_ivs = alice(
    lambda dict_pyu_object: extract_name_and_feature_iv(
        dict_pyu_object["feature_iv_info"]
    )
)(dict_pyu_object)
[7]:
# we can give the feature_ivs to bob
feature_ivs.to(bob)
# and/or we can reveal it to see it
sf.reveal(feature_ivs)
[7]:
[('x5', 0.37848298069087766), ('x7', 0)]

恭喜! 在本教程中,我们学习了如何

  1. 进行WOE编码

  2. 将一些iv信息分享给其他方