WOE编码#
以下代码仅用于演示。由于系统安全问题,请勿直接在生产中使用。
建议使用 jupyter 运行本教程。
分箱是基于排序方法创建独立变量的桶。分箱帮助我们将连续变量转换为分类变量。
WOE分箱实现了针对二元目标变量的数值变量的分箱。
bin_total = bin_positives + bin_negatives
total_labels = total_positives + total_negatives
bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))
bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe
目前我们为垂直分割的数据集提供WOE编码。
让我们先加载一个样本数据集。
[1]:
import pandas as pd
import secretflow as sf
from secretflow.data.vertical import VDataFrame
from secretflow.utils.simulation.datasets import load_linear
[2]:
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
# similarly for woe in heu
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
2023-10-07 18:03:18,152 INFO worker.py:1538 -- Started a local Ray instance.
[3]:
parts = {
bob: (1, 11),
alice: (11, 22),
}
vdf = load_linear(parts=parts)
[4]:
label_data = vdf['y']
y = sf.reveal(label_data.partitions[alice].data).values
现在,我们已经准备好执行WOE分箱和替换。
[5]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution
binning = VertWoeBinning(spu)
bin_rules = binning.binning(
vdf,
binning_method="chimerge",
bin_num=4,
bin_names={alice: ['x14'], bob: ["x5", "x7"]},
label_name="y",
)
woe_sub = VertBinSubstitution()
vdf = woe_sub.substitution(vdf, bin_rules)
# this is for demo only, be careful with reveal
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.
(SPURuntime pid=3788331) 2023-10-07 18:03:20.845 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
(_run pid=3781429) INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
(_run pid=3781429) WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3781429) [2023-10-07 18:03:21.594] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(SPURuntime pid=3788331) 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'
(SPURuntime pid=3788331) 2023-10-07 18:03:21.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3788331) 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:58305} (0x0x48efdc0): Connection refused [R1][E112]Not connected to 127.0.0.1:58305 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:58305 yet, server_id=0'
(SPURuntime pid=3788331) 2023-10-07 18:03:22.846 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=3788330) 2023-10-07 18:03:22.849 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:34077'
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party alice.
(SPURuntime(device_id=None, party=bob) pid=3788331) 2023-10-07 18:03:25.092 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
(SPURuntime(device_id=None, party=alice) pid=3788330) 2023-10-07 18:03:25.107 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
x11 x12 x13 x14 x15 x16 x17 \
0 0.241531 -0.705729 -0.020094 -0.493792 0.851992 0.035219 -0.796096
1 -0.402727 0.115744 0.468149 -0.735388 0.386395 0.712798 0.239583
2 0.872675 -0.559321 0.390246 -0.079604 0.225594 -0.639674 0.279511
3 -0.644718 -0.409382 0.141747 -0.682014 0.314084 -0.802476 0.348878
4 -0.949669 -0.940787 -0.951708 0.232566 0.272346 0.124419 0.853226
... ... ... ... ... ... ... ...
9995 -0.031331 -0.078700 -0.020636 -0.493792 0.210120 -0.288943 -0.262945
9996 0.047039 0.965614 -0.921435 -0.251231 0.205778 0.155392 0.922683
9997 0.269438 -0.115586 0.928880 0.513349 0.269042 -0.331772 0.520971
9998 0.999325 0.433372 -0.805999 0.232566 0.072405 0.973399 -0.123470
9999 -0.203443 0.772931 -0.146181 -0.305728 0.274590 0.803816 -0.312047
x18 x19 x20 y
0 0.810261 0.048303 0.937679 1
1 0.312728 0.526637 0.589773 1
2 0.039087 -0.753417 0.516735 0
3 -0.855979 0.250944 0.979465 1
4 -0.238805 0.243109 -0.121446 1
... ... ... ... ..
9995 -0.847253 0.069960 0.786748 1
9996 -0.502486 -0.076290 -0.604832 1
9997 -0.424209 0.434947 0.998955 1
9998 0.914291 -0.473056 0.616257 1
9999 -0.602927 -0.021368 0.885519 0
[10000 rows x 11 columns]
x1 x2 x3 x4 x5 x6 x7 \
0 -0.514226 0.730010 -0.730391 0.970483 0.038063 -0.800808 -0.082006
1 -0.725537 0.482244 -0.823223 0.202119 0.484290 -0.139781 -0.341216
2 0.608353 -0.071102 -0.775098 -0.391496 0.224127 0.082370 -0.341216
3 -0.686642 0.160470 0.914477 -0.269052 0.224127 -0.547841 -0.341216
4 -0.198111 0.212909 0.950474 0.775259 -0.590206 -0.840528 -0.742056
... ... ... ... ... ... ... ...
9995 -0.367246 -0.296454 0.558596 -0.403504 0.542892 0.000142 -0.341216
9996 0.010913 0.629268 -0.384093 -0.552787 0.542892 -0.100838 0.071673
9997 -0.238097 0.904069 -0.344859 -0.687887 -0.103900 0.223052 -0.286633
9998 0.453686 -0.375173 0.899238 0.908135 -0.590206 0.524051 0.347251
9999 -0.776015 -0.772112 0.012110 -0.898067 0.182405 -0.500491 0.557853
x8 x9 x10
0 -0.499206 -0.750112 -0.910640
1 -0.652901 0.438065 0.830206
2 -0.183506 -0.783842 -0.729929
3 -0.269405 -0.974268 -0.800515
4 0.800389 0.185542 0.183614
... ... ... ...
9995 -0.470127 -0.247682 -0.552526
9996 0.592903 -0.577123 -0.811461
9997 -0.172245 0.713149 -0.184585
9998 -0.558997 0.610076 -0.862191
9999 -0.275658 -0.250420 0.518420
[10000 rows x 10 columns]
有时我们可能需要IV值。根据GitHub issue, secretflow/secretflow#565,发布分桶的ivs可能会泄漏标签信息。目前,我们选择将桶的iv值保存在标签持有者设备中。标签持有者可以选择
不共享iv信息
共享一些选中的iv信息
我们将演示如何共享特征iv。
回想一下,woe_rules是一个字典 {PYU: PYUObject},其中每个 PYUObject 本身是以下类型的字典
{
"variables":[
{
"name": str, # feature name
"type": str, # "string" or "numeric", if feature is discrete or continuous
"categories": list[str], # categories for discrete feature
"split_points": list[float], # left-open right-close split points
"total_counts": list[int], # total samples count in each bins.
"else_counts": int, # np.nan samples count
"filling_values": list[float], # woe values for each bins.
"else_filling_value": float, # woe value for np.nan samples.
},
# ... others feature
],
# label holder's PYUObject only
# warning: giving bin_ivs to other party will leak positive samples in each bin.
# it is up to label holder's will to give feature iv or bin ivs or all info to workers.
# for more information, look at: https://github.com/secretflow/secretflow/issues/565
# in the following comment, by safe we mean label distribution info is not leaked.
"feature_iv_info" :[
{
"name": str, #feature name
"ivs": list[float], #iv values for each bins, not safe to share with workers in any case.
"else_iv": float, #iv for nan values, may share to with workers
"feature_iv": float, #sum of bin_ivs, safe to share with workers when bin num > 2.
}
]
}
[6]:
# alice is label holder
dict_pyu_object = bin_rules[alice]
def extract_name_and_feature_iv(list_of_feature_iv_info):
return [(d["name"], d["feature_iv"]) for d in list_of_feature_iv_info]
feature_ivs = alice(
lambda dict_pyu_object: extract_name_and_feature_iv(
dict_pyu_object["feature_iv_info"]
)
)(dict_pyu_object)
[7]:
# we can give the feature_ivs to bob
feature_ivs.to(bob)
# and/or we can reveal it to see it
sf.reveal(feature_ivs)
[7]:
[('x14', 0.43219177635839423), ('x5', 0.37848298069087766), ('x7', 0)]
[8]:
feature_iv_info = sf.reveal(feature_ivs)
df = pd.DataFrame.from_records(feature_iv_info, columns=['feature', 'iv'])
如何解释特征iv?
小于0.02 -> 对预测无用
0.02到0.1 -> 弱预测能力
0.1到0.3 -> 中等预测能力
0.3到0.5 -> 强预测能力
大于0.5 -> 可疑的预测能力
让我们选择前两个最重要的特征iv
[9]:
print(df.sort_values('iv', ascending=False).head(2))
feature iv
0 x14 0.432192
1 x5 0.378483
恭喜! 在本教程中,我们学习了如何
进行WOE编码
将一些iv信息分享给其他方
使用特征iv进行特征选择