secretflow.preprocessing.binning package#

Subpackages#

Submodules#

secretflow.preprocessing.binning.homo_binning module#

driver端程序

Classes:

HomoBinning([bin_num, compress_thres, ...])

entrance of federate binning

class secretflow.preprocessing.binning.homo_binning.HomoBinning(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False, max_iter: int = 10, aggregator=None)[source]#

Bases: ActorProxy(HomoBinningBase)

entrance of federate binning

bin_num#

how many buckets need to be split

compress_thres#

compression threshold. If the value is greater than the threshold, do compression

head_size#

buffer size

error#

error tolerance

bin_indexes#

index of features to binning

bin_names#

name of features to binning

abnormal_list#

list of anomaly features

allow_duplicate#

whether to allow duplicate bucket values

aggregator#

to aggregate values with aggregator

max_iter#

max iteration round

Methods:

__init__([bin_num, compress_thres, ...])

Abstraction device object base class.

fit_split_points(hdata)

entrance of federate binning

setup_header_param(header, bin_names, ...)

__init__(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False, max_iter: int = 10, aggregator=None)[source]#

Abstraction device object base class.

Parameters

device (Device) – Device where this object is located.

fit_split_points(hdata: HDataFrame)[source]#

entrance of federate binning

Parameters

data – HDataFrame,input data to binning

Returns

a dict of binning result, PYUObject

Return type

bin_result

setup_header_param(header: List[str], bin_names: List[str], bin_indexes: List[int])[source]#

secretflow.preprocessing.binning.homo_binning_base module#

Classes:

SplitPointNode(value, min_value, max_value)

Dataclass of split point node

HomoBinningBase

alias of ActorProxy(HomoBinningBase)

class secretflow.preprocessing.binning.homo_binning_base.SplitPointNode(value: float, min_value: float, max_value: float, aim_rank: int = -1, allow_error_rank: int = 0, error: float = 0.0001, fixed: bool = False)[source]#

Bases: object

Dataclass of split point node

value#

value of the split point

Type

float

min_value#

min value of the split point

Type

float

max_value#

nax value of the split point

Type

float

aim_rank#

aim rank of the split point

Type

int

allow_error_rank#

error tolerance on ranks

Type

int

error#

create a new node if the difference is greater than error

Type

float

fixed#

whether the split position converges

Type

bool

Attributes:

value

min_value

max_value

aim_rank

allow_error_rank

error

fixed

Methods:

create_right_new()

Search the right half

create_left_new()

Search the left half

__init__(value, min_value, max_value[, ...])

value: float#
min_value: float#
max_value: float#
aim_rank: int = -1#
allow_error_rank: int = 0#
error: float = 0.0001#
fixed: bool = False#
create_right_new()[source]#

Search the right half

create_left_new()[source]#

Search the left half

__init__(value: float, min_value: float, max_value: float, aim_rank: int = -1, allow_error_rank: int = 0, error: float = 0.0001, fixed: bool = False) None#
secretflow.preprocessing.binning.homo_binning_base.HomoBinningBase[source]#

alias of ActorProxy(HomoBinningBase) Methods:

__init__(*args, **kwargs)

Abstraction device object base class.

get_missing_count(*[, _ray_trace_ctx])

statistics of missing count of all parties

set_missing_dict(missing_count, *[, ...])

cal_summary_dict(data, *[, _ray_trace_ctx])

init_query_points(split_num[, error_rank, ...])

query points initialize

fit_split_points(data, *[, _ray_trace_ctx])

query_values(*[, _ray_trace_ctx])

Query what is the global rank for each current partition point :returns: Dict eg: {col1: [g_rank1], col2: [g_rank2] } :rtype: global_rank

query_table(summary, query_points, *[, ...])

Query the rank of query_points in the local summary

set_aim_rank(*[, _ray_trace_ctx])

set_header_param(bin_names, bin_indexes, ...)

get_split_points_dict(*[, _ray_trace_ctx])

renew_query_points(global_ranks, *[, ...])

Use to update query points

check_converge(*[, _ray_trace_ctx])

check convergence of federate binning

get_bin_result(*[, _ray_trace_ctx])

secretflow.preprocessing.binning.vert_woe_binning module#

Classes:

VertWoeBinning(secure_device)

woe binning for vertical slice datasets.

class secretflow.preprocessing.binning.vert_woe_binning.VertWoeBinning(secure_device: Union[SPU, HEU])[source]#

Bases: object

woe binning for vertical slice datasets.

Split all features into bins by equal frequency or ChiMerge. Then calculate woe value & iv value for each bin by SS or HE secure device to protect Y label.

Finally, this method will output binning rules used to substitute features’ value into woe by VertWOESubstitution.

more details about woe/iv value: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

secure_device#

HEU or SPU for secure bucket summation.

Methods:

__init__(secure_device)

binning(vdata[, binning_method, bin_num, ...])

Build woe substitution rules base on vdata.

__init__(secure_device: Union[SPU, HEU])[source]#
binning(vdata: VDataFrame, binning_method: str = 'quantile', bin_num: int = 10, bin_names: Dict[PYU, List[str]] = {}, label_name: str = '', positive_label: str = '1', chimerge_init_bins: int = 100, chimerge_target_bins: int = 10, chimerge_target_pvalue: float = 0.1, audit_log_path: Dict[str, str] = {})[source]#

Build woe substitution rules base on vdata. Only support binary classification label dataset.

vdata#

vertical slice datasets use {binning_method} to bin all number type features. for string type feature bin by it’s categories. else bin is count for np.nan samples

binning_method#

how to bin number type features. Options: “quantile”(equal frequency)/”chimerge”(ChiMerge from AAAI92-019) Default: “quantile”

bin_num#

max bin counts for one features. Range: (0, ∞] Default: 10

bin_names#

which features should be binned.

label_name#

label column name.

positive_label#

which value represent positive value in label.

chimerge_init_bins#

max bin counts for initialization binning in ChiMerge. Range: (2, ∞] Default: 100

chimerge_target_bins#

stop merge if remain bin counts is less than or equal to this value. Range: [2, {chimerge_init_bins}) Default: 10

chimerge_target_pvalue#

stop merge if biggest pvalue of remain bins is greater than this value. Range: (0, 1) Default: 0.1

audit_log_path#

output audit log for HEU encrypt to device’s local path. empty means disable. example: {‘alice’: ‘/path/to/alice/audit/filename’, ‘bob’: ‘bob/audit/filename’} NOTICE: Please !!DO NOT!! touch this options, leave it empty and disabled.

Unless you really know this option’s meaning and accept its risk.

Returns

Dict[PYU, PYUObject], PYUObject contain a dict for all features’ rule in this party.

{
    "variables":[
        {
            "name": str, # feature name
            "type": str, # "string" or "numeric", if feature is discrete or continuous
            "categories": list[str], # categories for discrete feature
            "split_points": list[float], # left-open right-close split points
            "total_counts": list[int], # total samples count in each bins.
            "else_counts": int, # np.nan samples count
            "woes": list[float], # woe values for each bins.
            "else_woe": float, # woe value for np.nan samples.
            "ivs": list[float], # iv values for each bins.
            "else_iv": float, # iv value for np.nan samples.
        },
        # ... others feature
    ]
}

secretflow.preprocessing.binning.vert_woe_binning_pyu module#

Classes:

VertWoeBinningPyuWorker

alias of ActorProxy(VertWoeBinningPyuWorker)

secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker[source]#

alias of ActorProxy(VertWoeBinningPyuWorker) Methods:

__init__(*args, **kwargs)

Abstraction device object base class.

master_work(data, *[, _ray_trace_ctx])

Label holder build report for it's own feature, and provide label to driver.

slave_build_sum_select(data, *[, _ray_trace_ctx])

build select matrix for driver to calculate positive samples by Secret Sharing.

slave_build_sum_indices(data, *[, ...])

return samples indices in bins for driver to calculate positive samples by Homomorphic encryption.

slave_sum_bin(bins_positive, *[, _ray_trace_ctx])

build bins stat tuple.

master_calc_woe_for_peer(bins_stat, *[, ...])

calculate woe/iv for slave party.

slave_build_report(woe_ivs, *[, _ray_trace_ctx])

build report based on master party's woe/iv values.

secretflow.preprocessing.binning.vert_woe_substitution module#

Classes:

VertWOESubstitutionPyuWorker

alias of ActorProxy(VertWOESubstitutionPyuWorker)

VertWOESubstitution()

secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitutionPyuWorker[source]#

alias of ActorProxy(VertWOESubstitutionPyuWorker) Methods:

sub(data, r, *[, _ray_trace_ctx])

PYU functions for woe substitution.

__init__(*args, **kwargs)

Abstraction device object base class.

class secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitution[source]#

Bases: object

Methods:

substitution(vdata, woe_rules)

substitute dataset's value by woe substitution rules.

substitution(vdata: VDataFrame, woe_rules: Dict[PYU, PYUObject]) VDataFrame[source]#

substitute dataset’s value by woe substitution rules.

Parameters
  • vdata – vertical slice dataset to be substituted.

  • woe_rules – woe substitution rules build by VertWoeBinning.

Returns

vertical slice dataset after substituted.

Return type

new_vdata

Module contents#