secretflow.ml.boost.homo_boost.tree_core package#

Submodules#

secretflow.ml.boost.homo_boost.tree_core.criterion module#

Classes:

Criterion()

Base class for split criterion

XgboostCriterion([reg_lambda, reg_alpha, ...])

XgboostCriterion 分裂规则类 .

class secretflow.ml.boost.homo_boost.tree_core.criterion.Criterion[source]#

Bases: ABC

Base class for split criterion

Methods:

split_gain(left_node_sum, right_node_sum)

abstract split_gain(left_node_sum, right_node_sum)[source]#
class secretflow.ml.boost.homo_boost.tree_core.criterion.XgboostCriterion(reg_lambda: float = 0.1, reg_alpha: float = 0, decimal: int = 10)[source]#

Bases: Criterion

XgboostCriterion 分裂规则类 .. attribute:: reg_lambda

L2 regularization term on weight

reg_alpha#

L1 regularization term on weight

decimal#

truncate parms

Methods:

__init__([reg_lambda, reg_alpha, decimal])

split_gain(node_sum, left_node_sum, ...)

Calculate split gain :param node_sum: After the split, Grad and Hess at this node :param left_node_sum: After the split, Grad and Hess at the left split point :param right_node_sum: After the split, Grad and Hess at the right split point

truncate(f[, decimal])

Truncate control precision can reduce training time with early stop

node_gain(sum_grad, sum_hess)

Calculate node gain :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

node_weight(sum_grad, sum_hess)

Calculte node weight :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

__init__(reg_lambda: float = 0.1, reg_alpha: float = 0, decimal: int = 10)[source]#
split_gain(node_sum: Tuple[float, float], left_node_sum: Tuple[float, float], right_node_sum: Tuple[float, float]) float[source]#

Calculate split gain :param node_sum: After the split, Grad and Hess at this node :param left_node_sum: After the split, Grad and Hess at the left split point :param right_node_sum: After the split, Grad and Hess at the right split point

Returns

Split gain of this split

Return type

gain

static truncate(f, decimal=10)[source]#

Truncate control precision can reduce training time with early stop

node_gain(sum_grad: float, sum_hess: float) float[source]#

Calculate node gain :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

Returns

Gain of this node

node_weight(sum_grad: float, sum_hess: float) float[source]#

Calculte node weight :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

Returns

Weight of this node

secretflow.ml.boost.homo_boost.tree_core.decision_tree module#

Classes:

DecisionTree([tree_param, data, ...])

Class for local version decision tree

class secretflow.ml.boost.homo_boost.tree_core.decision_tree.DecisionTree(tree_param: Optional[TreeParam] = None, data: Optional[DataFrame] = None, bin_split_points: Optional[ndarray] = None, tree_id: Optional[int] = None, group_id: Optional[int] = None, iter_round: Optional[int] = None, grad_key: str = 'grad', hess_key: str = 'hess', label_key: str = 'label')[source]#

Bases: object

Class for local version decision tree

tree_param#

params for tree build

data#

training data, HdataFrame

bin_split_points#

global binning infos

tree_id#

tree id

group_id#

group id indicates which class the tree classifies

iter_round#

iteration round

hess_key#

unique column name for hess value

grad_key#

unique column name for grad value

Methods:

__init__([tree_param, data, ...])

feature_col_sample(all_features[, sample_rate])

Column sample for features :param all_features: A list of feature names for all columns :param sample_rate: subsample rate, a float-number in [0, 1]

get_feature_importance()

convert_bin_to_real()

convert bid to real value

columns_filter()

get_grad_hess_sum(data_frame)

calculate sum of grad and hess :param data_frame: data frame which contains hess and grad

update_feature_importance(split_info)

Calculate feature importance default split count :param split_info: Global optimal splitting information calculated from histogram

fit()

Entrance for local decision tree

update_tree(cur_to_split, split_info, ...)

Tree update function :param cur_to_split: List of nodes to be split :param split_info: Global optim split info :param cur_data_frames: List of dataframe in each node

init_xgboost_model(model_path)

Init standard xgboost model :param model_path: model path

save_xgboost_model(model_path, tree_nodes)

Transform tree info to standard xgboost model ref: https://xgboost.readthedocs.io/en/latest/dev/structxgboost_1_1TreeParam.html#aab8ff286e59f1bbab47bfa865da4a107 :param model_path: model path :param tree_nodes: federate decision tree internal model

__init__(tree_param: Optional[TreeParam] = None, data: Optional[DataFrame] = None, bin_split_points: Optional[ndarray] = None, tree_id: Optional[int] = None, group_id: Optional[int] = None, iter_round: Optional[int] = None, grad_key: str = 'grad', hess_key: str = 'hess', label_key: str = 'label')[source]#
feature_col_sample(all_features: List[str], sample_rate: float = 1.0)[source]#

Column sample for features :param all_features: A list of feature names for all columns :param sample_rate: subsample rate, a float-number in [0, 1]

Returns

A dict of valid features, which will be use in this round built

Return type

valid_features

get_feature_importance()[source]#
convert_bin_to_real()[source]#

convert bid to real value

columns_filter()[source]#
get_grad_hess_sum(data_frame)[source]#

calculate sum of grad and hess :param data_frame: data frame which contains hess and grad

Returns

sum of grad hess: sum of hess

Return type

grad

update_feature_importance(split_info)[source]#

Calculate feature importance default split count :param split_info: Global optimal splitting information calculated from histogram

fit()[source]#

Entrance for local decision tree

update_tree(cur_to_split: List[Node], split_info: List[SplitInfo], cur_data_frames: List[DataFrame])[source]#

Tree update function :param cur_to_split: List of nodes to be split :param split_info: Global optim split info :param cur_data_frames: List of dataframe in each node

Returns

List of nodes to be evaluated in the next iteration next_layer_data: List of data to be evaluated in the next iteration

Return type

next_layer_node

init_xgboost_model(model_path: str)[source]#

Init standard xgboost model :param model_path: model path

save_xgboost_model(model_path: str, tree_nodes: List[Node])[source]#

Transform tree info to standard xgboost model ref: https://xgboost.readthedocs.io/en/latest/dev/structxgboost_1_1TreeParam.html#aab8ff286e59f1bbab47bfa865da4a107 :param model_path: model path :param tree_nodes: federate decision tree internal model

Returns

update standard xgboost model on the model path

secretflow.ml.boost.homo_boost.tree_core.feature_histogram module#

Classes:

HistogramBag([histogram, hid, p_hid])

Histogram container

FeatureHistogram()

Feature Histogram

class secretflow.ml.boost.homo_boost.tree_core.feature_histogram.HistogramBag(histogram: Optional[List] = None, hid: int = -1, p_hid: int = -1)[source]#

Bases: object

Histogram container

histogram#

Histogram list calculated by calculate_histogram

Type

List

hid#

histogram id

Type

int

p_hid#

parent histogram id

Type

int

Attributes:

histogram

hid

p_hid

Methods:

binary_op(other, func[, inplace])

__init__([histogram, hid, p_hid])

histogram: List = None#
hid: int = -1#
p_hid: int = -1#
binary_op(other, func: callable, inplace: bool = False)[source]#
__init__(histogram: Optional[List] = None, hid: int = -1, p_hid: int = -1) None#
class secretflow.ml.boost.homo_boost.tree_core.feature_histogram.FeatureHistogram[source]#

Bases: object

Feature Histogram

Methods:

calculate_histogram(data_frame_list, ...[, ...])

Calculate histogram according to G and H histogram: [cols,[buckets,[sum_g,sum_h,count]]

calculate_single_histogram(data, bin_split_point)

static calculate_histogram(data_frame_list: List[DataFrame], bin_split_points: ndarray, valid_features: Optional[Dict] = None, use_missing: bool = False, grad_key: str = 'grad', hess_key: str = 'hess', thread_pool: Optional[ThreadPoolExecutor] = None)[source]#

Calculate histogram according to G and H histogram: [cols,[buckets,[sum_g,sum_h,count]]

Parameters
  • data_frame_list – A list of data frame, which contain grad and hess

  • bin_split_points – global split point dicts

  • valid_features – valid feature names Dict[id:bool]

  • use_missing – whether missing value participate in train

  • grad_key – unique column name for grad value

  • hess_key – unique column name for hess value

Returns

一个List[histogram1, histogram2, …]

Return type

node_histograms

static calculate_single_histogram(data: ndarray, bin_split_point: ndarray)[source]#

secretflow.ml.boost.homo_boost.tree_core.feature_importance module#

Classes:

FeatureImportance([main_importance, ...])

Feature importance class

class secretflow.ml.boost.homo_boost.tree_core.feature_importance.FeatureImportance(main_importance: float = 0, other_importance: float = 0, main_type: str = 'split')[source]#

Bases: object

Feature importance class

main_importance#

main importance value, ref main_type

other_importance#

other importance value, ref opposite to main_type

main_type#

type of importance, eg:gain

Methods:

__init__([main_importance, ...])

add_gain(val)

add_split(val)

__init__(main_importance: float = 0, other_importance: float = 0, main_type: str = 'split')[source]#
add_gain(val: float)[source]#
add_split(val: float)[source]#

secretflow.ml.boost.homo_boost.tree_core.loss_function module#

Classes:

LossFunction(obj_name)

Inner define for loss functions

class secretflow.ml.boost.homo_boost.tree_core.loss_function.LossFunction(obj_name: str)[source]#

Bases: object

Inner define for loss functions

obj_name#

Name of loss function in [“binary:logistic”,# logistic regression “reg:logistic”, # logistic regression for binary classification, output probability “multi:softmax”, # logistic regression for binary classification, output score before logistic transformation “multi:softprob”, # logistic regression for binary classification, output probability “reg:squarederror” # for multi label classification ]

Methods:

__init__(obj_name)

obj_function()

__init__(obj_name: str)[source]#
obj_function()[source]#

secretflow.ml.boost.homo_boost.tree_core.node module#

Classes:

Node([id, fid, bid, weight, is_leaf, ...])

Tree Node

class secretflow.ml.boost.homo_boost.tree_core.node.Node(id: Optional[int] = None, fid: Optional[int] = None, bid: Optional[int] = None, weight: float = 0.0, is_leaf: bool = False, sum_grad: Optional[float] = None, sum_hess: Optional[float] = None, left_nodeid: int = -1, right_nodeid: int = -1, missing_dir: int = 1, sample_num: int = 0, parent_nodeid: Optional[int] = None, is_left_node: bool = False, sibling_nodeid: Optional[int] = None, loss_change: float = 0.0)[source]#

Bases: object

Tree Node

id#

node id

Type

int

fid#

feature id

Type

int

bid#

bucket id

Type

int

weight#

node weight

Type

float

is_leaf#

whether this node is leaf

Type

bool

sum_grad#

sum of grad

Type

float

sum_hess#

sum of hess

Type

float

left_nodeid#

left node id

Type

int

right_nodeid#

right node id

Type

int

missing_dir#

which branch to go when encounting missing value default 1->right

Type

int

sample_num#

num of data sample

Type

int

parent_nodeid#

parent nodeid

Type

int

is_left_node#

is this node if left child of the parent

Type

bool

sibling_nodeid#

sibling node id

Type

int

loss_change#

the loss change.

Type

float

Attributes:

id

fid

bid

weight

is_leaf

sum_grad

sum_hess

left_nodeid

right_nodeid

missing_dir

sample_num

parent_nodeid

is_left_node

sibling_nodeid

loss_change

Methods:

__init__([id, fid, bid, weight, is_leaf, ...])

id: int = None#
fid: int = None#
bid: int = None#
weight: float = 0.0#
is_leaf: bool = False#
sum_grad: float = None#
sum_hess: float = None#
left_nodeid: int = -1#
right_nodeid: int = -1#
missing_dir: int = 1#
sample_num: int = 0#
parent_nodeid: int = None#
is_left_node: bool = False#
sibling_nodeid: int = None#
loss_change: float = 0.0#
__init__(id: Optional[int] = None, fid: Optional[int] = None, bid: Optional[int] = None, weight: float = 0.0, is_leaf: bool = False, sum_grad: Optional[float] = None, sum_hess: Optional[float] = None, left_nodeid: int = -1, right_nodeid: int = -1, missing_dir: int = 1, sample_num: int = 0, parent_nodeid: Optional[int] = None, is_left_node: bool = False, sibling_nodeid: Optional[int] = None, loss_change: float = 0.0) None#

secretflow.ml.boost.homo_boost.tree_core.splitter module#

Classes:

SplitInfo([best_fid, best_bid, sum_grad, ...])

Split Info .

Splitter(criterion_method[, ...])

Split Calculate Class .

class secretflow.ml.boost.homo_boost.tree_core.splitter.SplitInfo(best_fid: Optional[int] = None, best_bid: Optional[int] = None, sum_grad: float = 0, sum_hess: float = 0, gain: Optional[float] = None, missing_dir: int = 1, sample_count: int = -1)[source]#

Bases: object

Split Info .. attribute:: best_fid

best split on feature id

type

int

best_bid#

best split on bucket id

Type

int

sum_grad#

sum of grad

Type

float

sum_hess#

sum of hess

Type

float

gain#

split gain

Type

float

missing_dir#

which branch to go when encounting missing value default 1->right

Type

int

sample_count#

num of sample after split

Type

int

Attributes:

best_fid

best_bid

sum_grad

sum_hess

gain

missing_dir

sample_count

Methods:

__init__([best_fid, best_bid, sum_grad, ...])

best_fid: int = None#
best_bid: int = None#
sum_grad: float = 0#
sum_hess: float = 0#
gain: float = None#
missing_dir: int = 1#
sample_count: int = -1#
__init__(best_fid: Optional[int] = None, best_bid: Optional[int] = None, sum_grad: float = 0, sum_hess: float = 0, gain: Optional[float] = None, missing_dir: int = 1, sample_count: int = -1) None#
class secretflow.ml.boost.homo_boost.tree_core.splitter.Splitter(criterion_method: str, criterion_params: List = [0, 0, 10], min_impurity_split: float = 0.01, min_sample_split: int = 2, min_leaf_node: int = 1, min_child_weight: int = 1)[source]#

Bases: object

Split Calculate Class .. attribute:: criterion_method

criterion method

criterion_params#

criterion parms, eg[l1: 0.1, l2: 0.2]

min_impurity_split#

minimum gain threshold of splitting

min_sample_split#

minimum sample split of splitting, default to 2

min_leaf_node#

minimum samples on node to split

min_child_weight#

minimum sum of hess after split

Methods:

__init__(criterion_method[, ...])

find_split_once(histogram, valid_features, ...)

Find best split info from histogram

find_split(histograms, valid_features[, ...])

查找最优分裂点 :param histograms: a list of histogram :param valid_features: valid feature names Dict[id:bool] :param use_missing: whether missing value participate in train

node_gain(grad, hess)

node_weight(grad, hess)

split_gain(sum_grad, sum_hess, sum_grad_l, ...)

__init__(criterion_method: str, criterion_params: List = [0, 0, 10], min_impurity_split: float = 0.01, min_sample_split: int = 2, min_leaf_node: int = 1, min_child_weight: int = 1)[source]#
find_split_once(histogram: List, valid_features: Dict, use_missing: bool) SplitInfo[source]#

Find best split info from histogram

Parameters
  • histogram – a three-dimensional matrix store G,H,Count

  • valid_features – valid feature names Dict[id:bool]

  • use_missing – whether missing value participate in train

Returns

best split point info

Return type

SplitInfo

find_split(histograms: List, valid_features: Dict, use_missing: bool = False) List[SplitInfo][source]#

查找最优分裂点 :param histograms: a list of histogram :param valid_features: valid feature names Dict[id:bool] :param use_missing: whether missing value participate in train

Returns

best split info on each node

Return type

tree_node_splitinfo

node_gain(grad: float, hess: float) float[source]#
node_weight(grad: float, hess: float) float[source]#
split_gain(sum_grad: float, sum_hess: float, sum_grad_l: float, sum_hess_l: float, sum_grad_r: float, sum_hess_r: float) float[source]#

Module contents#