secretflow.preprocessing.binning.kernels package#

Submodules#

secretflow.preprocessing.binning.kernels.base_binning module#

Classes:

BaseBinning(bin_names, bin_indexes, bin_num, ...)

class secretflow.preprocessing.binning.kernels.base_binning.BaseBinning(bin_names: List, bin_indexes: List, bin_num: int, abnormal_list: List)[source]#

Bases: ABC

Methods:

__init__(bin_names, bin_indexes, bin_num, ...)

fit_split_points(data)

Attributes:

split_points

__init__(bin_names: List, bin_indexes: List, bin_num: int, abnormal_list: List)[source]#
property split_points#
abstract fit_split_points(data)[source]#

secretflow.preprocessing.binning.kernels.quantile_binning module#

Classes:

QuantileBinning([bin_num, compress_thres, ...])

Use QuantileSummary algorithm for constant frequency binning

class secretflow.preprocessing.binning.kernels.quantile_binning.QuantileBinning(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], local_only: bool = False, abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False)[source]#

Bases: BaseBinning

Use QuantileSummary algorithm for constant frequency binning

bin_num#

the num of buckets

compress_thres#

if size of summary greater than compress_thres, do compress operation

cols_dict#

mapping of value to index. {key: col_name , value: index}.

head_size#

buffer size

error#

0 <= error < 1 default: 0.001,error tolerance, floor((p - 2 * error) * N) <= rank(x) <= ceil((p + 2 * error) * N)

abnormal_list#

list of anomaly features.

summary_dict#

a dict store summary of each features

col_name_maps#

a dict store column index to name

bin_idx_name#

a dict store index to name

allow_duplicate#

Whether duplication is allowed

Methods:

__init__([bin_num, compress_thres, ...])

fit_split_points(data_frame)

calculate bin split points base on QuantileSummary algorithm

feature_summary(data_frame, compress_thres, ...)

calculate summary

__init__(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], local_only: bool = False, abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False)[source]#
fit_split_points(data_frame: DataFrame) DataFrame[source]#

calculate bin split points base on QuantileSummary algorithm

Parameters

data_frame – input data

Returns

bin result returned as dataframe

Return type

bin_result

static feature_summary(data_frame: DataFrame, compress_thres: int, head_size: int, error: float, bin_dict: Dict[str, int], abnormal_list: List[str]) Dict[source]#

calculate summary

Parameters
  • data_frame – pandas.DataFrame, input data

  • compress_thres – int,

  • head_size – int, buffer size, when

  • error – float, error tolerance

  • bin_dict – a dict store col name to index

  • abnormal_list – list of anomaly features

secretflow.preprocessing.binning.kernels.quantile_summaries module#

Classes:

Stats(value, w, delta)

store information for each item in the summary

QuantileSummaries([compress_thres, ...])

QuantileSummary

class secretflow.preprocessing.binning.kernels.quantile_summaries.Stats(value: float, w: int, delta: int)[source]#

Bases: object

store information for each item in the summary

value#

value of this stat

Type

float

w#

weight of this stat

Type

int

delta#

delta = rmax - rmin

Type

int

Attributes:

value

w

delta

Methods:

__init__(value, w, delta)

value: float#
w: int#
delta: int#
__init__(value: float, w: int, delta: int) None#
class secretflow.preprocessing.binning.kernels.quantile_summaries.QuantileSummaries(compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, abnormal_list: Optional[List] = None)[source]#

Bases: object

QuantileSummary

insert: insert data to summary merge: merge summaries fast_init: A fast version implementation creates the summary with little performance loss compress: compress summary to some size

compress_thres#

if num of stats greater than compress_thres, do compress

head_size#

buffer size for insert data, when samples come to head_size do create summary

error#

0 <= error < 1 default: 0.001, error tolerance for binning. floor((p - 2 * error) * N) <= rank(x) <= ceil((p + 2 * error) * N)

abnormal_list#

List of abnormal feature, will not participate in binning

Methods:

__init__([compress_thres, head_size, error, ...])

fast_init(col_data)

compress()

compress the summary, summary.sample will under compress_thres

query(quantile)

Use to query the value that specifies the quantile location

value_to_rank(value)

batch_query_value(values)

batch query function

__init__(compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, abnormal_list: Optional[List] = None)[source]#
fast_init(col_data: ndarray)[source]#
compress()[source]#

compress the summary, summary.sample will under compress_thres

query(quantile: float) float[source]#

Use to query the value that specifies the quantile location

Parameters

quantile – float [0.0, 1.0]

Returns

float, the value of the quantile location

value_to_rank(value: Union[float, int]) int[source]#
batch_query_value(values: List[float]) List[int][source]#

batch query function

Parameters

values – List sorted_list of value. eg:[13, 56, 79]

Returns

output ranks of each query

Return type

List

Module contents#