SecretFlow 组件列表#

Last update: Sat Oct 14 16:41:07 2023

Version: 0.0.1

SecretFlow官方组件。

feature#

vert_bin_substitution#

组件版本：0.0.1

使用分箱规则替代数据集的值。

输入#

名称	描述	类型	注释
input_data	需要进行替代的垂直分区数据集。	[‘sf.table.vertical_table’]
bin_rule	输入分箱规则	[‘sf.rule.binning’]

输出#

名称	描述	类型	注释
output_data	输出联合表	[‘sf.table.vertical_table’]

vert_binning#

组件版本：0.0.1

为垂直分割数据集生成分箱规则。

参数#

名称	描述	类型	必填	注释
binning_method	如何对数值型特征进行分箱：”quantile”（等频）, “eq_range” (等距)	String	N	默认：eq_range。可选：[‘eq_range’, ‘quantile’]。
bin_num	一个特征最大分箱数。	Integer	N	默认: 10. 范围: (0, $\infty$).

输入#

名称	描述	类型	注释
input_data	输入联合表	[‘sf.table.vertical_table’]	额外的表属性。 (0) feature_selects - 应该对哪些特征进行分箱。要选择的最小数量（包括在内）：1。

输出#

名称	描述	类型	注释
bin_rule	输出分桶规则	[‘sf.rule.binning’]

vert_woe_binning#

组件版本：0.0.1

为垂直分割数据集生成权重信息（WOE）分箱规则。

参数#

名称	描述	类型	必填	注释
secure_device_type	使用 SPU（安全多方计算, MPC）或 HEU（同态加密, HE）来保护桶求和。	String	N	默认：spu。可选：[‘spu’，’heu’]。
binning_method	如何对数值型特征进行分箱：”quantile”（等频）, “chimerge”（来自AAAI92-019的ChiMerge算法： https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf ）	String	N	默认：quantile。可选：[‘quantile’，’chimerge’]。
bin_num	一个特征最大分箱数。	Integer	N	默认: 10. 范围: (0, $\infty$).
positive_label	标签中表示正例的值。	String	N	默认: 1.
chimerge_init_bins	ChiMerge初始化分箱的最大bin数。	Integer	N	默认: 100. 范围: (2, $\infty$).
chimerge_target_bins	如果剩余分箱数量小于或等于此值，则停止合并。	Integer	N	默认: 10. 范围: [2, $\infty$).
chimerge_target_pvalue	如果剩余分箱中最大的p值大于此值，则停止合并。	Float	N	默认: 0.1. 范围: (0.0, 1.0].

输入#

名称	描述	类型	注释
input_data	输入联合表	[‘sf.table.vertical_table’]	额外的表属性。 (0) feature_selects - 应该对哪些特征进行分箱。要选择的最小数量（包括在内）：1。

输出#

名称	描述	类型	注释
bin_rule	输出WOE规则	[‘sf.rule.binning’]

ml.eval#

biclassification_eval#

组件版本：0.0.1

数据集上二分类模型的统计评估。

summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
head_report: List[PrReport] reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

参数#

名称	描述	类型	必填	注释
bucket_size	分桶数。	Integer	N	默认: 10. 范围: [1, $\infty$).
min_item_cnt_per_bucket	每个桶的最小项目数。如果任何一个桶不满足要求，则会引发错误。出于安全原因，我们要求此参数至少为5。	Integer	N	默认: 5. 范围: [5, $\infty$).

输入#

名称	描述	类型	注释
labels	Input table with labels	[‘sf.table.vertical_table’, ‘sf.table.individual’]	额外的表属性。(0) col - 在数据集中要使用的列名。如果未提供，则默认使用数据集的标签。要选择的最大列数量（包括在内）：1。
predictions	输入预测表	[‘sf.table.vertical_table’, ‘sf.table.individual’]	额外的表属性。(0) col - 在数据集中要使用的列名。如果未提供，则默认使用数据集的标签。要选择的最大列数量（包括在内）：1。

输出#

名称	描述	类型	注释
reports	输出报告	[‘sf.report’]

prediction_bias_eval#

组件版本：0.0.1

计算预测偏差，即预测平均值 - 标签平均值

参数#

名称	描述	类型	必填	注释
bucket_num	桶数量	Integer	N	默认: 10. 范围: [1, $\infty$).
min_item_cnt_per_bucket	每个桶的最小项目数。如果任何一个桶不满足要求，则会引发错误。出于安全原因，我们要求此参数至少为2。	Integer	N	默认: 2. 范围: [2, $\infty$).
bucket_method	分桶方式	String	N	默认: equal_width. 允许值: [‘equal_width’, ‘equal_frequency’].

输入#

名称	描述	类型	注释
labels	Input table with labels.	[‘sf.table.vertical_table’, ‘sf.table.individual’]	额外的表属性。(0) col - 在数据集中要使用的列名。如果未提供，则默认使用数据集的标签。要选择的最大列数量（包括在内）：1。
predictions	输入预测表	[‘sf.table.vertical_table’, ‘sf.table.individual’]	额外的表属性。(0) col - 在数据集中要使用的列名。如果未提供，则默认使用数据集的标签。要选择的最大列数量（包括在内）：1。

输出#

名称	描述	类型	注释
result	输出报告	[‘sf.report’]

ss_pvalue#

组件版本：0.0.1

通过使用秘密共享计算垂直分区数据集上的LR模型训练的P值。对于大型数据集（大于10w个样本和200个特征），建议使用[Ring size: 128，Fxp: 40]选项来 SPU 设备。

输入#

名称	描述	类型	注释
model	输入模型	[‘sf.model.ss_sgd’]
input_data	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
report	输出P-VALUE结果表	[‘sf.report’]

ml.predict#

sgb_predict#

组件版本：0.0.1

使用SGB模型进行预测。

参数#

名称	描述	类型	必填	注释
receiver	接收方	String	Y	默认 .
pred_name	预测值列名	String	N	默认：pred.
save_ids	是否将ids列保存到输出预测表中。如果为true，则输入feature_dataset必须包含id列，并且接收方必须是id所有者。	Boolean	N	默认：False。
save_label	是否将真实标签列保存到输出预测表中。如果为true，则输入的feature_dataset必须包含标签列，并且接收方必须是标签所有者。	Boolean	N	默认：False。

输入#

名称	描述	类型	注释
model	model	[‘sf.model.sgb’]
feature_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
pred	输出预测值	[‘sf.table.individual’]

ss_glm_predict#

组件版本：0.0.1

使用 SSGLM 模型进行预测

参数#

名称	描述	类型	必填	注释
receiver	接收方	String	Y	默认 .
pred_name	预测值列名	String	N	默认：pred.
save_ids	是否将ids列保存到输出预测表中。如果为true，则输入feature_dataset必须包含id列，并且接收方必须是id所有者。	Boolean	N	默认：False。
save_label	是否将真实标签列保存到输出预测表中。如果为true，则输入的feature_dataset必须包含标签列，并且接收方必须是标签所有者。	Boolean	N	默认：False。
offset_col	Specify a column to use as the offset	String	N	默认 .

输入#

名称	描述	类型	注释
model	输入模型	[‘sf.model.ss_glm’]
feature_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
pred	输出预测值	[‘sf.table.individual’]

ss_sgd_predict#

组件版本：0.0.1

使用 SS-SGD 模型预测

参数#

名称	描述	类型	必填	注释
batch_size	在一次迭代中使用的训练样本数量。	Integer	N	默认: 1024. 范围: (0, $\infty$).
receiver	接收方	String	Y	默认 .
pred_name	预测值列名	String	N	默认：pred.
save_ids	是否将ids列保存到输出预测表中。如果为true，则输入feature_dataset必须包含id列，并且接收方必须是id所有者。	Boolean	N	默认：False。
save_label	是否将真实标签列保存到输出预测表中。如果为true，则输入的feature_dataset必须包含标签列，并且接收方必须是标签所有者。	Boolean	N	默认：False。

输入#

名称	描述	类型	注释
model	输入模型	[‘sf.model.ss_sgd’]
feature_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
pred	输出预测值	[‘sf.table.individual’]

ss_xgb_predict#

组件版本：0.0.1

使用 SS-XGB 模型预测

参数#

名称	描述	类型	必填	注释
receiver	接收方	String	Y	默认 .
pred_name	预测值列名	String	N	默认：pred.
save_ids	是否将ids列保存到输出预测表中。如果为true，则输入feature_dataset必须包含id列，并且接收方必须是id所有者。	Boolean	N	默认：False。
save_label	是否将真实标签列保存到输出预测表中。如果为true，则输入的feature_dataset必须包含标签列，并且接收方必须是标签所有者。	Boolean	N	默认：False。

输入#

名称	描述	类型	注释
model	输入模型	[‘sf.model.ss_xgb’]
feature_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
pred	输出预测值	[‘sf.table.individual’]

ml.train#

sgb_train#

组件版本：0.0.1

此方法使用秘密共享，为垂直分区的数据集设置提供了分类和回归树提升(也称为GBDT、GBM)。

SGB是SecureBoost的缩写。与更安全的SS-XGB相比，SecureBoost专注于保护标签持有人。
详情请见：https://arxiv.org/pdf/2005.08479.pdf 。

参数#

名称	描述	类型	必填	注释
num_boost_round	boosting轮数	Integer	N	默认: 10. 范围: [1, $\infty$).
max_depth	树的最大深度。	Integer	N	默认: 5. 范围: [1, 16].
learning_rate	用于更新的步长缩小，以防止过拟合。	Float	N	默认: 0.1. 范围: (0.0, 1.0].
objective	指定学习目标。	String	N	默认: logistic. 可选: [‘linear’, ‘logistic’].
reg_lambda	权重的L2正则化项。	Float	N	默认: 0.1. 范围: [0.0, 10000.0].
gamma	大于0表示启用预修剪枝。如果节点的增益小于此值，则会被修剪。	Float	N	默认: 0.1. 范围: [0.0, 10000.0].
colsample_by_tree	构建每棵树时列的子样本比率	Float	N	默认: 1.0. 范围: (0.0, 1.0].
sketch_eps	这大致转化为O(1 / sketch_eps)个桶。	Float	N	默认: 0.1. 范围: (0.0, 1.0].
base_score	所有实例的初始预测分数，全局偏差。	Float	N	默认: 0.0. 范围: [0.0, $\infty$).
seed	伪随机数生成器的种子。	Integer	N	默认: 42. 范围: [0, $\infty$).
fixed_point_parameter	由 heu 编码的任何浮点数, 将乘以一个刻度并取四舍五入, scale = 2 ** fixed_point_parameter。值越大可能意味着数值精度越高, 但太大会导致溢出问题。	Integer	N	默认: 20. 范围: [1, 16].
first_tree_with_label_holder_feature	Whether to train the first tree with label holder’s own features.	Boolean	N	默认：False。
batch_encoding_enabled	If use batch encoding optimization.	Boolean	N	默认: True.
enable_quantization	Whether enable quantization of g and h.	Boolean	N	默认：False。
quantization_scale	Scale the sum of g to the specified value.	Float	N	默认: 10000. 范围: [0.0, 10000000.0].
max_leaf	Maximum leaf of a tree. Only effective if train leaf wise.	Integer	N	默认: 15. 范围: [1, 32768].
rowsample_by_tree	训练实例的行子采样比率。	Float	N	默认: 1.0. 范围: (0.0, 1.0].
enable_goss	Whether to enable GOSS.	Boolean	N	默认：False。
top_rate	GOSS-specific parameter. The fraction of large gradients to sample.	Float	N	默认: 0.3. 范围: (0.0, 1.0].
bottom_rate	GOSS-specific parameter. The fraction of small gradients to sample.	Float	N	默认: 0.5. 范围: (0.0, 1.0].
early_stop_criterion_g_abs_sum	If sum(abs(g)) is lower than or equal to this threshold, training will stop.	Float	N	默认: 0.0. 范围: [0.0, $\infty$).
early_stop_criterion_g_abs_sum_change_ratio	If absolute g sum change ratio is lower than or equal to this threshold, training will stop.	Float	N	默认: 0.0. 范围: [0.0, 1.0].
tree_growing_method	How to grow tree?	String	N	默认：level。

输入#

名称	描述	类型	注释
train_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
output_model	输出模型	[‘sf.model.sgb’]

ss_glm_train#

组件版本：0.0.1

generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

参数#

名称	描述	类型	必填	注释
epochs	完整遍历训练数据的次数。	Integer	N	默认: 10. 范围: [1, $\infty$).
learning_rate	每次迭代中的步长。	Float	N	默认: 0.1. 范围: (0.0, $\infty$).
batch_size	在一次迭代中使用的训练样本数量。	Integer	N	默认: 1024. 范围: (0, $\infty$).
链接函数类型	link function type	String	Y	默认: None. 可选: [‘Logit’, ‘Log’, ‘Reciprocal’, ‘Indentity’].
label_dist_type	label distribution type	String	Y	默认: None. 可选: [‘Bernoulli’, ‘Poisson’, ‘Gamma’, ‘Tweedie’].
tweedie_power	Tweedie distribution power parameter	Float	N	默认: 1.0. 范围: (0.0, 2.0].
dist_scale	A guess value for distribution’s scale	Float	N	默认: 1.0. 范围: [1.0, $\infty$).
eps	如果权重的变化率小于此阈值，该模型被认为是收敛的，训练提前停止。0表示禁用。	Float	N	默认: 0.0001. 范围: (0.0, $\infty$).
iter_start_irls	run a few rounds of IRLS training as the initialization of w, 0 disable	Integer	N	默认: 0. 范围: [0, $\infty$).
decay_epoch	decay learning interval	Integer	N	默认: 0. 范围: [0, $\infty$).
decay_rate	decay learning rate	Float	N	默认: 0.0. 范围: [0.0, 1.0].
optimizer	which optimizer to use: IRLS(Iteratively Reweighted Least Squares) or SGD(Stochastic Gradient Descent)	String	Y	默认：spu。可选：[‘SGD’, ‘IRLS’]。
offset_col	Specify a column to use as the offset	String	N	默认 .
weight_col	Specify a column to use for the observation weights	String	N	默认 .

输入#

名称	描述	类型	注释
train_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
output_model	输出模型	[‘sf.model.ss_glm’]

ss_sgd_train#

组件版本：0.0.1

使用秘密共享，通过小批量SGD训练求解器，在垂直划分数据集上训练线性和逻辑回归线性模型。

SS-SGD是秘密共享SGD训练的缩写

参数#

名称	描述	类型	必填	注释
epochs	完整遍历训练数据的次数。	Integer	N	默认: 10. 范围: [1, $\infty$).
learning_rate	每次迭代中的步长。	Float	N	默认: 0.1. 范围: (0.0, $\infty$).
batch_size	在一次迭代中使用的训练样本数量。	Integer	N	默认: 1024. 范围: (0, $\infty$).
sig_type	Sigmoid近似类型。	String	N	默认: t1. 可选: [‘real’, ‘t1’, ‘t3’, ‘t5’, ‘df’, ‘sr’, ‘mix’].
reg_type	回归类型	String	N	默认: logistic. 可选: [‘linear’, ‘logistic’].
penalty	要使用的惩罚项(又称正则化项)。	String	N	默认: None. 可选: [‘None’, ‘l1’, ‘l2’].
l2_norm	L2正则化项。	Float	N	默认: 0.5. 范围: [0.0, $\infty$).
eps	如果权重的变化率小于此阈值，该模型被认为是收敛的，训练提前停止。0表示禁用。	Float	N	默认: 0.001. 范围: (0.0, $\infty$).

输入#

名称	描述	类型	注释
train_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
output_model	输出模型	[‘sf.model.ss_sgd’]

ss_xgb_train#

组件版本：0.0.1

此方法使用秘密共享，为垂直分区的数据集设置提供了分类和回归树提升(也称为GBDT、GBM)。

SS-XGB是秘密分享XGB的缩写。
详情请见：https://arxiv.org/pdf/2005.08479.pdf 。

参数#

名称	描述	类型	必填	注释
num_boost_round	boosting轮数	Integer	N	默认: 10. 范围: [1, $\infty$).
max_depth	树的最大深度。	Integer	N	默认: 5. 范围: [1, 16].
learning_rate	用于更新的步长缩减，以防止过度拟合。	Float	N	默认: 0.1. 范围: (0.0, 1.0].
objective	指定学习目标。	String	N	默认: logistic. 可选: [‘linear’, ‘logistic’].
reg_lambda	权重的L2正则化项。	Float	N	默认: 0.1. 范围: [0.0, 10000.0].
subsample	训练实例的子采样比率。	Float	N	默认: 0.1. 范围: (0.0, 1.0].
colsample_by_tree	构建每棵树时列的子样本比率	Float	N	默认: 0.1. 范围: (0.0, 1.0].
sketch_eps	这大致转化为O(1 / sketch_eps)个桶。	Float	N	默认: 0.1. 范围: (0.0, 1.0].
base_score	所有实例的初始预测分数，全局偏差。	Float	N	默认: 0.0. 范围: [0.0, $\infty$).
seed	伪随机数生成器的种子。	Integer	N	默认: 42. 范围: [0, $\infty$).

输入#

名称	描述	类型	注释
train_dataset	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
output_model	输出模型	[‘sf.model.ss_xgb’]

preprocessing#

feature_filter#

组件版本：0.0.1

从数据集中删除特征

输入#

名称	描述	类型	注释
in_ds	输入联合表	[‘sf.table.vertical_table’]	额外的表属性。(0) drop_features - 要删除的特征。

输出#

名称	描述	类型	注释
out_ds	输出联合表	[‘sf.table.vertical_table’]

psi#

组件版本：0.0.1

两方之间的PSI（隐私保护交集计算）。

参数#

名称	描述	类型	必填	注释
protocol	PSI 协议	String	N	默认: ECDH_PSI_2PC. 可选: [‘ECDH_PSI_2PC’, ‘KKRT_PSI_2PC’, ‘BC22_PSI_2PC’].
sort	将结果排序。	Boolean	N	默认：False。
bucket_size	在隐私保护交集计算中指定哈希桶大小。较大的值消耗更多内存。	Integer	N	默认: 1048576. 范围: (0, $\infty$).
ecdh_curve_type	ECDH PSI的曲线类型。	String	N	默认: CURVE_FOURQ. 可选: [‘CURVE_25519’, ‘CURVE_FOURQ’, ‘CURVE_SM2’, ‘CURVE_SECP256K1’].

输入#

名称	描述	类型	注释
receiver_input	Individual table for receiver	[‘sf.table.individual’]	额外的表属性。(0) key - 用于连接的列。如果未提供，则将使用数据集的ID。
sender_input	发送方的样本表	[‘sf.table.individual’]	额外的表属性。(0) key - 用于连接的列。如果未提供，则将使用数据集的ID。

输出#

名称	描述	类型	注释
psi_output	输出联合表	[‘sf.table.vertical_table’]

train_test_split#

组件版本：0.0.1

将数据集随机切分成训练集和测试集。

请参阅：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

参数#

名称	描述	类型	必填	注释
train_size	要包含在训练子集中的数据集的比例。	Float	N	默认: 0.75. 范围: [0.0, 1.0].
test_size	要包含在测试子集中的数据集的比例。	Float	N	默认: 0.25. 范围: [0.0, 1.0].
random_state	指定随机种子	Integer	N	默认: 1024. 范围: (0, $\infty$).
shuffle	是否在拆分之前对数据进行洗牌。	Boolean	N	默认: True.

输入#

名称	描述	类型	注释
input_data	输入联合表	[‘sf.table.vertical_table’]

输出#

名称	描述	类型	注释
train	输出训练数据集	[‘sf.table.vertical_table’]
test	输出测试数据集	[‘sf.table.vertical_table’]

stats#

ss_pearsonr#

组件版本：0.0.1

使用秘密共享计算垂直分区数据集的Pearson积矩相关系数。

对于大型数据集（大于10w个样本和200个特征），建议使用[Ring size: 128，Fxp: 40]选项来 SPU 设备。

输入#

名称	描述	类型	注释
input_data	输入联合表	[‘sf.table.vertical_table’]	额外的表属性。(0) feature_selects - 指定要计算相关系数的特征。如果为空，将使用所有特征。

输出#

名称	描述	类型	注释
report	输出相关系数矩阵表。	[‘sf.report’]

ss_vif#

组件版本：0.0.1

使用秘密共享计算垂直分区数据集的方差膨胀因子（VIF）。

对于大型数据集（大于10w个样本和200个特征），建议使用[Ring size: 128，Fxp: 40]选项来 SPU 设备。

输入#

名称	描述	类型	注释
input_data	输入联合表	[‘sf.table.vertical_table’]	额外的表属性。(0) feature_selects - 指定要计算VIF的特征。如果为空，将使用所有特征。

输出#

名称	描述	类型	注释
report	Output Variance Inflation Factor(VIF) report.	[‘sf.report’]

table_statistics#

组件版本：0.0.1

获取统计信息表，包括每列的

datatype
total_count
count
count_na
min
max
var
std
sem
skewness
kurtosis
q1
q2
q3
moment_2
moment_3
moment_4
central_moment_2
central_moment_3
central_moment_4
sum
sum_2
sum_3
sum_4

moment_2 表示 E[X^2].
central_moment_2 表示 E[(X - mean(X))^2].
sum_2 表示 sum(X^2).

输入#

名称	描述	类型	注释
input_data	输入数据。	[‘sf.table.vertical_table’, ‘sf.table.individual’]

输出#

名称	描述	类型	注释
report	输出全表统计结果表	[‘sf.report’]