SecretFlow Component List#

Last update: Sat Oct 14 16:41:07 2023

Version: 0.0.1

First-party SecretFlow components.

feature#

vert_bin_substitution#

Component version: 0.0.1

Substitute datasets’ value by bin substitution rules.

Inputs#

Name	Description	Type(s)	Notes
input_data	Vertical partitioning dataset to be substituted.	[‘sf.table.vertical_table’]
bin_rule	Input bin substitution rule.	[‘sf.rule.binning’]

Outputs#

Name	Description	Type(s)	Notes
output_data	Output vertical table.	[‘sf.table.vertical_table’]

vert_binning#

Component version: 0.0.1

Generate equal frequency or equal range binning rules for vertical partitioning datasets.

Attrs#

Name	Description	Type	Required	Notes
binning_method	How to bin features with numeric types: “quantile”(equal frequency)/”eq_range”(equal range)	String	N	Default: eq_range. Allowed: [‘eq_range’, ‘quantile’].
bin_num	Max bin counts for one features.	Integer	N	Default: 10. Range: (0, $\infty$).

Inputs#

Name	Description	Type(s)	Notes
input_data	Input vertical table.	[‘sf.table.vertical_table’]	Extra table attributes.(0) feature_selects - which features should be binned. Min column number to select(inclusive): 1.

Outputs#

Name	Description	Type(s)	Notes
bin_rule	Output bin rule.	[‘sf.rule.binning’]

vert_woe_binning#

Component version: 0.0.1

Generate Weight of Evidence (WOE) binning rules for vertical partitioning datasets.

Attrs#

Name	Description	Type	Required	Notes
secure_device_type	Use SPU(Secure multi-party computation or MPC) or HEU(Homomorphic encryption or HE) to secure bucket summation.	String	N	Default: spu. Allowed: [‘spu’, ‘heu’].
binning_method	How to bin features with numeric types: “quantile”(equal frequency)/”chimerge”(ChiMerge from AAAI92-019: https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf)	String	N	Default: quantile. Allowed: [‘quantile’, ‘chimerge’].
bin_num	Max bin counts for one features.	Integer	N	Default: 10. Range: (0, $\infty$).
positive_label	Which value represent positive value in label.	String	N	Default: 1.
chimerge_init_bins	Max bin counts for initialization binning in ChiMerge.	Integer	N	Default: 100. Range: (2, $\infty$).
chimerge_target_bins	Stop merging if remaining bin counts is less than or equal to this value.	Integer	N	Default: 10. Range: [2, $\infty$).
chimerge_target_pvalue	Stop merging if biggest pvalue of remaining bins is greater than this value.	Float	N	Default: 0.1. Range: (0.0, 1.0].

Inputs#

Name	Description	Type(s)	Notes
input_data	Input vertical table.	[‘sf.table.vertical_table’]	Extra table attributes.(0) feature_selects - which features should be binned. Min column number to select(inclusive): 1.

Outputs#

Name	Description	Type(s)	Notes
bin_rule	Output WOE rule.	[‘sf.rule.binning’]

ml.eval#

biclassification_eval#

Component version: 0.0.1

Statistics evaluation for a bi-classification model on a dataset.

summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
head_report: List[PrReport] reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

Attrs#

Name	Description	Type	Required	Notes
bucket_size	Number of buckets.	Integer	N	Default: 10. Range: [1, $\infty$).
min_item_cnt_per_bucket	Min item cnt per bucket. If any bucket doesn’t meet the requirement, error raises. For security reasons, we require this parameter to be at least 5.	Integer	N	Default: 5. Range: [5, $\infty$).

Inputs#

Name	Description	Type(s)	Notes
labels	Input table with labels	[‘sf.table.vertical_table’, ‘sf.table.individual’]	Extra table attributes.(0) col - The column name to use in the dataset. If not provided, the label of dataset will be used by default. Max column number to select(inclusive): 1.
predictions	Input table with predictions	[‘sf.table.vertical_table’, ‘sf.table.individual’]	Extra table attributes.(0) col - The column name to use in the dataset. If not provided, the label of dataset will be used by default. Max column number to select(inclusive): 1.

Outputs#

Name	Description	Type(s)	Notes
reports	Output report.	[‘sf.report’]

prediction_bias_eval#

Component version: 0.0.1

Calculate prediction bias, ie. average of predictions - average of labels.

Attrs#

Name	Description	Type	Required	Notes
bucket_num	Num of bucket.	Integer	N	Default: 10. Range: [1, $\infty$).
min_item_cnt_per_bucket	Min item cnt per bucket. If any bucket doesn’t meet the requirement, error raises. For security reasons, we require this parameter to be at least 2.	Integer	N	Default: 2. Range: [2, $\infty$).
bucket_method	Bucket method.	String	N	Default: equal_width. Allowed: [‘equal_width’, ‘equal_frequency’].

Inputs#

Name	Description	Type(s)	Notes
labels	Input table with labels.	[‘sf.table.vertical_table’, ‘sf.table.individual’]	Extra table attributes.(0) col - The column name to use in the dataset. If not provided, the label of dataset will be used by default. Max column number to select(inclusive): 1.
predictions	Input table with predictions.	[‘sf.table.vertical_table’, ‘sf.table.individual’]	Extra table attributes.(0) col - The column name to use in the dataset. If not provided, the label of dataset will be used by default. Max column number to select(inclusive): 1.

Outputs#

Name	Description	Type(s)	Notes
result	Output report.	[‘sf.report’]

ss_pvalue#

Component version: 0.0.1

Calculate P-Value for LR model training on vertical partitioning dataset by using secret sharing. For large dataset(large than 10w samples & 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.

Inputs#

Name	Description	Type(s)	Notes
model	Input model.	[‘sf.model.ss_sgd’]
input_data	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
report	Output P-Value report.	[‘sf.report’]

ml.predict#

sgb_predict#

Component version: 0.0.1

Predict using SGB model.

Attrs#

Name	Description	Type	Required	Notes
receiver	Party of receiver.	String	Y	Default: .
pred_name	Name for prediction column	String	N	Default: pred.
save_ids	Whether to save ids columns into output prediction table. If true, input feature_dataset must contain id columns, and receiver party must be id owner.	Boolean	N	Default: False.
save_label	Whether or not to save real label columns into output pred file. If true, input feature_dataset must contain label columns and receiver party must be label owner.	Boolean	N	Default: False.

Inputs#

Name	Description	Type(s)	Notes
model	model	[‘sf.model.sgb’]
feature_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
pred	Output prediction.	[‘sf.table.individual’]

ss_glm_predict#

Component version: 0.0.1

Predict using the SSGLM model.

Attrs#

Name	Description	Type	Required	Notes
receiver	Party of receiver.	String	Y	Default: .
pred_name	Column name for predictions.	String	N	Default: pred.
save_ids	Whether to save ids columns into output prediction table. If true, input feature_dataset must contain id columns, and receiver party must be id owner.	Boolean	N	Default: False.
save_label	Whether or not to save real label columns into output pred file. If true, input feature_dataset must contain label columns and receiver party must be label owner.	Boolean	N	Default: False.
offset_col	Specify a column to use as the offset	String	N	Default: .

Inputs#

Name	Description	Type(s)	Notes
model	Input model.	[‘sf.model.ss_glm’]
feature_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
pred	Output prediction.	[‘sf.table.individual’]

ss_sgd_predict#

Component version: 0.0.1

Predict using the SS-SGD model.

Attrs#

Name	Description	Type	Required	Notes
batch_size	The number of training examples utilized in one iteration.	Integer	N	Default: 1024. Range: (0, $\infty$).
receiver	Party of receiver.	String	Y	Default: .
pred_name	Column name for predictions.	String	N	Default: pred.
save_ids	Whether to save ids columns into output prediction table. If true, input feature_dataset must contain id columns, and receiver party must be id owner.	Boolean	N	Default: False.
save_label	Whether or not to save real label columns into output pred file. If true, input feature_dataset must contain label columns and receiver party must be label owner.	Boolean	N	Default: False.

Inputs#

Name	Description	Type(s)	Notes
model	Input model.	[‘sf.model.ss_sgd’]
feature_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
pred	Output prediction.	[‘sf.table.individual’]

ss_xgb_predict#

Component version: 0.0.1

Predict using the SS-XGB model.

Attrs#

Name	Description	Type	Required	Notes
receiver	Party of receiver.	String	Y	Default: .
pred_name	Column name for predictions.	String	N	Default: pred.
save_ids	Whether to save ids columns into output prediction table. If true, input feature_dataset must contain id columns, and receiver party must be id owner.	Boolean	N	Default: False.
save_label	Whether or not to save real label columns into output pred file. If true, input feature_dataset must contain label columns and receiver party must be label owner.	Boolean	N	Default: False.

Inputs#

Name	Description	Type(s)	Notes
model	Input model.	[‘sf.model.ss_xgb’]
feature_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
pred	Output prediction.	[‘sf.table.individual’]

ml.train#

sgb_train#

Component version: 0.0.1

Provides both classification and regression tree boosting (also known as GBDT, GBM) for vertical split dataset setting by using secure boost.

SGB is short for SecureBoost. Compared to its safer counterpart SS-XGB, SecureBoost focused on protecting label holder.
Check https://arxiv.org/abs/1901.08755.

Attrs#

Name	Description	Type	Required	Notes
num_boost_round	Number of boosting iterations.	Integer	N	Default: 10. Range: [1, $\infty$).
max_depth	Maximum depth of a tree.	Integer	N	Default: 5. Range: [1, 16].
learning_rate	Step size shrinkage used in update to prevent overfitting.	Float	N	Default: 0.1. Range: (0.0, 1.0].
objective	Specify the learning objective.	String	N	Default: logistic. Allowed: [‘linear’, ‘logistic’].
reg_lambda	L2 regularization term on weights.	Float	N	Default: 0.1. Range: [0.0, 10000.0].
gamma	Greater than 0 means pre-pruning enabled. If gain of a node is less than this value, it would be pruned.	Float	N	Default: 0.1. Range: [0.0, 10000.0].
colsample_by_tree	Subsample ratio of columns when constructing each tree.	Float	N	Default: 1.0. Range: (0.0, 1.0].
sketch_eps	This roughly translates into O(1 / sketch_eps) number of bins.	Float	N	Default: 0.1. Range: (0.0, 1.0].
base_score	The initial prediction score of all instances, global bias.	Float	N	Default: 0.0. Range: [0.0, $\infty$).
seed	Pseudorandom number generator seed.	Integer	N	Default: 42. Range: [0, $\infty$).
fixed_point_parameter	Any floating point number encoded by heu, will multiply a scale and take the round, scale = 2 ** fixed_point_parameter. larger value may mean more numerical accuracy, but too large will lead to overflow problem.	Integer	N	Default: 20. Range: [1, 100].
first_tree_with_label_holder_feature	Whether to train the first tree with label holder’s own features.	Boolean	N	Default: False.
batch_encoding_enabled	If use batch encoding optimization.	Boolean	N	Default: True.
enable_quantization	Whether enable quantization of g and h.	Boolean	N	Default: False.
quantization_scale	Scale the sum of g to the specified value.	Float	N	Default: 10000.0. Range: [0.0, 10000000.0].
max_leaf	Maximum leaf of a tree. Only effective if train leaf wise.	Integer	N	Default: 15. Range: [1, 32768].
rowsample_by_tree	Row sub sample ratio of the training instances.	Float	N	Default: 1.0. Range: (0.0, 1.0].
enable_goss	Whether to enable GOSS.	Boolean	N	Default: False.
top_rate	GOSS-specific parameter. The fraction of large gradients to sample.	Float	N	Default: 0.3. Range: (0.0, 1.0].
bottom_rate	GOSS-specific parameter. The fraction of small gradients to sample.	Float	N	Default: 0.5. Range: (0.0, 1.0].
early_stop_criterion_g_abs_sum	If sum(abs(g)) is lower than or equal to this threshold, training will stop.	Float	N	Default: 0.0. Range: [0.0, $\infty$).
early_stop_criterion_g_abs_sum_change_ratio	If absolute g sum change ratio is lower than or equal to this threshold, training will stop.	Float	N	Default: 0.0. Range: [0.0, 1.0].
tree_growing_method	How to grow tree?	String	N	Default: level.

Inputs#

Name	Description	Type(s)	Notes
train_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
output_model	Output model.	[‘sf.model.sgb’]

ss_glm_train#

Component version: 0.0.1

generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Attrs#

Name	Description	Type	Required	Notes
epochs	The number of complete pass through the training data.	Integer	N	Default: 10. Range: [1, $\infty$).
learning_rate	The step size at each iteration in one iteration.	Float	N	Default: 0.1. Range: (0.0, $\infty$).
batch_size	The number of training examples utilized in one iteration.	Integer	N	Default: 1024. Range: (0, $\infty$).
link_type	link function type	String	Y	Default: . Allowed: [‘Logit’, ‘Log’, ‘Reciprocal’, ‘Indentity’].
label_dist_type	label distribution type	String	Y	Default: . Allowed: [‘Bernoulli’, ‘Poisson’, ‘Gamma’, ‘Tweedie’].
tweedie_power	Tweedie distribution power parameter	Float	N	Default: 1.0. Range: [0.0, 2.0].
dist_scale	A guess value for distribution’s scale	Float	N	Default: 1.0. Range: [1.0, $\infty$).
eps	If the change rate of weights is less than this threshold, the model is considered to be converged, and the training stops early. 0 to disable.	Float	N	Default: 0.0001. Range: [0.0, $\infty$).
iter_start_irls	run a few rounds of IRLS training as the initialization of w, 0 disable	Integer	N	Default: 0. Range: [0, $\infty$).
decay_epoch	decay learning interval	Integer	N	Default: 0. Range: [0, $\infty$).
decay_rate	decay learning rate	Float	N	Default: 0.0. Range: [0.0, 1.0).
optimizer	which optimizer to use: IRLS(Iteratively Reweighted Least Squares) or SGD(Stochastic Gradient Descent)	String	Y	Default: . Allowed: [‘SGD’, ‘IRLS’].
offset_col	Specify a column to use as the offset	String	N	Default: .
weight_col	Specify a column to use for the observation weights	String	N	Default: .

Inputs#

Name	Description	Type(s)	Notes
train_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
output_model	Output model.	[‘sf.model.ss_glm’]

ss_sgd_train#

Component version: 0.0.1

Train both linear and logistic regression linear models for vertical partitioning dataset with mini batch SGD training solver by using secret sharing.

SS-SGD is short for secret sharing SGD training.

Attrs#

Name	Description	Type	Required	Notes
epochs	The number of complete pass through the training data.	Integer	N	Default: 10. Range: [1, $\infty$).
learning_rate	The step size at each iteration in one iteration.	Float	N	Default: 0.1. Range: (0.0, $\infty$).
batch_size	The number of training examples utilized in one iteration.	Integer	N	Default: 1024. Range: (0, $\infty$).
sig_type	Sigmoid approximation type.	String	N	Default: t1. Allowed: [‘real’, ‘t1’, ‘t3’, ‘t5’, ‘df’, ‘sr’, ‘mix’].
reg_type	Regression type	String	N	Default: logistic. Allowed: [‘linear’, ‘logistic’].
penalty	The penalty(aka regularization term) to be used.	String	N	Default: None. Allowed: [‘None’, ‘l1’, ‘l2’].
l2_norm	L2 regularization term.	Float	N	Default: 0.5. Range: [0.0, $\infty$).
eps	If the change rate of weights is less than this threshold, the model is considered to be converged, and the training stops early. 0 to disable.	Float	N	Default: 0.001. Range: [0.0, $\infty$).

Inputs#

Name	Description	Type(s)	Notes
train_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
output_model	Output model.	[‘sf.model.ss_sgd’]

ss_xgb_train#

Component version: 0.0.1

This method provides both classification and regression tree boosting (also known as GBDT, GBM) for vertical partitioning dataset setting by using secret sharing.

SS-XGB is short for secret sharing XGB.
More details: https://arxiv.org/pdf/2005.08479.pdf

Attrs#

Name	Description	Type	Required	Notes
num_boost_round	Number of boosting iterations.	Integer	N	Default: 10. Range: [1, $\infty$).
max_depth	Maximum depth of a tree.	Integer	N	Default: 5. Range: [1, 16].
learning_rate	Step size shrinkage used in updates to prevent overfitting.	Float	N	Default: 0.1. Range: (0.0, 1.0].
objective	Specify the learning objective.	String	N	Default: logistic. Allowed: [‘linear’, ‘logistic’].
reg_lambda	L2 regularization term on weights.	Float	N	Default: 0.1. Range: [0.0, 10000.0].
subsample	Subsample ratio of the training instances.	Float	N	Default: 0.1. Range: (0.0, 1.0].
colsample_by_tree	Subsample ratio of columns when constructing each tree.	Float	N	Default: 0.1. Range: (0.0, 1.0].
sketch_eps	This roughly translates into O(1 / sketch_eps) number of bins.	Float	N	Default: 0.1. Range: (0.0, 1.0].
base_score	The initial prediction score of all instances, global bias.	Float	N	Default: 0.0. Range: [0.0, $\infty$).
seed	Pseudorandom number generator seed.	Integer	N	Default: 42. Range: [0, $\infty$).

Inputs#

Name	Description	Type(s)	Notes
train_dataset	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
output_model	Output model.	[‘sf.model.ss_xgb’]

preprocessing#

feature_filter#

Component version: 0.0.1

Drop features from the dataset.

Inputs#

Name	Description	Type(s)	Notes
in_ds	Input vertical table.	[‘sf.table.vertical_table’]	Extra table attributes.(0) drop_features - Features to drop.

Outputs#

Name	Description	Type(s)	Notes
out_ds	Output vertical table.	[‘sf.table.vertical_table’]

psi#

Component version: 0.0.1

PSI between two parties.

Attrs#

Name	Description	Type	Required	Notes
protocol	PSI protocol.	String	N	Default: ECDH_PSI_2PC. Allowed: [‘ECDH_PSI_2PC’, ‘KKRT_PSI_2PC’, ‘BC22_PSI_2PC’].
sort	Sort the output.	Boolean	N	Default: False.
bucket_size	Specify the hash bucket size used in PSI. Larger values consume more memory.	Integer	N	Default: 1048576. Range: (0, $\infty$).
ecdh_curve_type	Curve type for ECDH PSI.	String	N	Default: CURVE_FOURQ. Allowed: [‘CURVE_25519’, ‘CURVE_FOURQ’, ‘CURVE_SM2’, ‘CURVE_SECP256K1’].

Inputs#

Name	Description	Type(s)	Notes
receiver_input	Individual table for receiver	[‘sf.table.individual’]	Extra table attributes.(0) key - Column(s) used to join. If not provided, ids of the dataset will be used.
sender_input	Individual table for sender	[‘sf.table.individual’]	Extra table attributes.(0) key - Column(s) used to join. If not provided, ids of the dataset will be used.

Outputs#

Name	Description	Type(s)	Notes
psi_output	Output vertical table	[‘sf.table.vertical_table’]

train_test_split#

Component version: 0.0.1

Split datasets into random train and test subsets.

Please check: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Attrs#

Name	Description	Type	Required	Notes
train_size	Proportion of the dataset to include in the train subset.	Float	N	Default: 0.75. Range: [0.0, 1.0].
test_size	Proportion of the dataset to include in the test subset.	Float	N	Default: 0.25. Range: [0.0, 1.0].
random_state	Specify the random seed of the shuffling.	Integer	N	Default: 1024. Range: (0, $\infty$).
shuffle	Whether to shuffle the data before splitting.	Boolean	N	Default: True.

Inputs#

Name	Description	Type(s)	Notes
input_data	Input vertical table.	[‘sf.table.vertical_table’]

Outputs#

Name	Description	Type(s)	Notes
train	Output train dataset.	[‘sf.table.vertical_table’]
test	Output test dataset.	[‘sf.table.vertical_table’]

stats#

ss_pearsonr#

Component version: 0.0.1

Calculate Pearson’s product-moment correlation coefficient for vertical partitioning dataset by using secret sharing.

For large dataset(large than 10w samples & 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.

Inputs#

Name	Description	Type(s)	Notes
input_data	Input vertical table.	[‘sf.table.vertical_table’]	Extra table attributes.(0) feature_selects - Specify which features to calculate correlation coefficient with. If empty, all features will be used

Outputs#

Name	Description	Type(s)	Notes
report	Output Pearson’s product-moment correlation coefficient report.	[‘sf.report’]

ss_vif#

Component version: 0.0.1

Calculate Variance Inflation Factor(VIF) for vertical partitioning dataset by using secret sharing.

For large dataset(large than 10w samples & 200 features), recommend to use [Ring size: 128, Fxp: 40] options for SPU device.

Inputs#

Name	Description	Type(s)	Notes
input_data	Input vertical table.	[‘sf.table.vertical_table’]	Extra table attributes.(0) feature_selects - Specify which features to calculate VIF with. If empty, all features will be used.

Outputs#

Name	Description	Type(s)	Notes
report	Output Variance Inflation Factor(VIF) report.	[‘sf.report’]

table_statistics#

Component version: 0.0.1

Get a table of statistics, including each column’s

datatype
total_count
count
count_na
min
max
var
std
sem
skewness
kurtosis
q1
q2
q3
moment_2
moment_3
moment_4
central_moment_2
central_moment_3
central_moment_4
sum
sum_2
sum_3
sum_4

moment_2 means E[X^2].
central_moment_2 means E[(X - mean(X))^2].
sum_2 means sum(X^2).

Inputs#

Name	Description	Type(s)	Notes
input_data	Input table.	[‘sf.table.vertical_table’, ‘sf.table.individual’]

Outputs#

Name	Description	Type(s)	Notes
report	Output table statistics report.	[‘sf.report’]