secretflow.stats package#

Subpackages#

secretflow.stats.core package

Submodules#

secretflow.stats.biclassification_eval module#

Classes:

BiClassificationEval(y_true, y_score, ...)

Statistics Evaluation for a bi-classification model on a dataset.

class secretflow.stats.biclassification_eval.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#

基类：object

Statistics Evaluation for a bi-classification model on a dataset.

Attribute:

y_true: Union[FedNdarray, VDataFrame]: input of labels
y_score: Union[FedNdarray, VDataFrame]: input of prediction scores
bucket_size: int: input of number of bins in report

Methods:

`__init__`(y_true, y_score, bucket_size)
`get_all_reports`()	get all reports.

__init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#

get_all_reports() → PYUObject[源代码]#

get all reports. The reports contains:

summary_report: SummaryReport

group_reports: List[GroupReport]

eq_frequent_bin_report: List[EqBinReport]

eq_range_bin_report: List[EqBinReport]

head_report: List[PrReport]: reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

see more in core.biclassification_eval_core

secretflow.stats.psi_eval module#

Functions:

psi_eval(X, Y, split_points)

Calculate population stability index.

secretflow.stats.psi_eval.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) → PYUObject[源代码]#

Calculate population stability index.

参数

X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points

返回

float: population stability index

返回类型

result

secretflow.stats.pva_eval module#

Functions:

pva_eval(actual, prediction, target)

Compute Prediction Vs Actual score.

secretflow.stats.pva_eval.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) → PYUObject[源代码]#

Compute Prediction Vs Actual score.

参数

actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.

compute:

result: PYUObject: Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))

secretflow.stats.regression_eval module#

Classes:

RegressionEval(y_true, y_pred[, spu_device, ...])

Statistics Evaluation for a regression model on a dataset.

class secretflow.stats.regression_eval.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#

基类：object

Statistics Evaluation for a regression model on a dataset.

y_true#: FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.

y_pred#: FedNdarray y_true and y_pred must have the same device and partition shapes

r2_score#: Union[PYUObject, SPUObject]

mean_abs_err#: Union[PYUObject, SPUObject]

mean_abs_percent_err#: Union[PYUObject, SPUObject]

sum_squared_errors#: Union[PYUObject, SPUObject]

mean_squared_errors#: Union[PYUObject, SPUObject]

root_mean_squared_errors#: Union[PYUObject, SPUObject]

y_true_mean#: Union[PYUObject, SPUObject]

y_pred_mean#: Union[PYUObject, SPUObject]

residual_hist#: Union[PYUObject, SPUObject]

Methods:

`__init__`(y_true, y_pred[, spu_device, bins])
`gen_all_reports`()

__init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#

gen_all_reports()[源代码]#

secretflow.stats.score_card module#

Classes:

ScoreCard(odd_base, scaled_value, pdo[, ...])

The component provides a mapping procedure from binary regression's probability value to an integer range score.

class secretflow.stats.score_card.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#

基类：object

The component provides a mapping procedure from binary regression’s probability value to an integer range score.

The mapping process is as follows:: odds = pred / (1 - pred) score = offset + factor * log(odds)
The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:: scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
The offset and factor can be solved using these three constraint parameters:: factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))

odd_base / scaled_value / pdo: see above

max_score#: up limit for score

min_score#: down limit for score

bad_label_value#: which label represents the negative sample

Methods:

`__init__`(odd_base, scaled_value, pdo[, ...])
`transform`(pred)	computer pvalue for lr model

__init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#

transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) → FedNdarray[源代码]#

computer pvalue for lr model

参数: pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
返回: mapped scores.

secretflow.stats.ss_pearsonr_v module#

Classes:

PearsonR(device)

Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.

class secretflow.stats.ss_pearsonr_v.PearsonR(device: SPU)[源代码]#

基类：object

Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.

more detail for PearsonR: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

device#: SPU Device

Methods:

__init__(device)

pearsonr(vdata[, standardize])

vdata#

__init__(device: SPU)[源代码]#

pearsonr(vdata: VDataFrame, standardize: bool = True)[源代码]#

vdata#: VDataFrame vertical slice dataset.

standardize#: bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.

secretflow.stats.ss_pvalue_v module#

Classes:

PVlaue(spu)

Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.

class secretflow.stats.ss_pvalue_v.PVlaue(spu: SPU)[源代码]#

基类：object

Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.

more detail for P-Value: https://www.w3schools.com/datascience/ds_linear_regression_pvalue.asp

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

device#: SPU Device

Methods:

`__init__`(spu)
`pvalues`(x, y, model)	computer pvalue for lr model

__init__(spu: SPU) → None[源代码]#

pvalues(x: VDataFrame, y: VDataFrame, model: LinearModel) → ndarray[源代码]#

computer pvalue for lr model

参数

x – VDataFrame input dataset
y – VDataFrame true label
model – LinearModel lr model

返回

PValue

secretflow.stats.ss_vif_v module#

Classes:

VIF(device)

Calculate variance inflation factor for vertical slice dataset by using secret sharing.

class secretflow.stats.ss_vif_v.VIF(device: SPU)[源代码]#

基类：object

Calculate variance inflation factor for vertical slice dataset by using secret sharing.

see https://en.wikipedia.org/wiki/Variance_inflation_factor

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

NOTICE: The analytical solution of matrix inversion in secret sharing is very expensive, so this method uses Newton iteration to find approximate solution.

When there is multicollinearity in the input dataset, the XTX matrix is not full rank, and the analytical solution for the inverse of the XTX matrix does not exist.

The VIF results of these linear correlational columns calculated by statsmodels are INF, indicating that the correlation is infinite. However, this method will get a large VIF value (>>1000) on these columns, which can also correctly reflect the strong correlation of these columns.

When there are constant columns in the data, the VIF result calculated by statsmodels is NAN, and the result of this method is also a large VIF value (>> 1000), means these columns need to be removed before training.

Therefore, although the results of this method cannot be completely consistent with statemodels that calculations in plain text, but they can still correctly reflect the correlation of the input data columns.

device#: SPU Device

Methods:

__init__(device)

vif(vdata[, standardize])

vdata#

__init__(device: SPU)[源代码]#

vif(vdata: VDataFrame, standardize: bool = True)[源代码]#

vdata#: VDataFrame vertical slice dataset.

standardize#: bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.

secretflow.stats.table_statistics module#

Functions:

table_statistics(table)

Get table statistics for a pd.DataFrame or VDataFrame.

secretflow.stats.table_statistics.table_statistics(table: Union[DataFrame, VDataFrame]) → DataFrame[源代码]#

Get table statistics for a pd.DataFrame or VDataFrame.

参数

table – Union[pd.DataFrame, VDataFrame]

返回

pd.DataFrame

including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.

moment_2 means E[X^2].

central_moment_2 means E[(X - mean(X))^2].

sum_2 means sum(X^2).

返回类型

table_statistics

Module contents#

Classes:

`SSVertPearsonR`	`PearsonR` 的别名
`SSVertVIF`	`VIF` 的别名
`SSPValue`	`PVlaue` 的别名
`RegressionEval`(y_true, y_pred[, spu_device, ...])	Statistics Evaluation for a regression model on a dataset.
`BiClassificationEval`(y_true, y_score, ...)	Statistics Evaluation for a bi-classification model on a dataset.
`ScoreCard`(odd_base, scaled_value, pdo[, ...])	The component provides a mapping procedure from binary regression's probability value to an integer range score.

Functions:

`pva_eval`(actual, prediction, target)	Compute Prediction Vs Actual score.
`table_statistics`(table)	Get table statistics for a pd.DataFrame or VDataFrame.
`psi_eval`(X, Y, split_points)	Calculate population stability index.

secretflow.stats.SSVertPearsonR#

PearsonR 的别名 Methods:

__init__(device)

pearsonr(vdata[, standardize])

secretflow.stats.vdata#

secretflow.stats.SSVertVIF#

VIF 的别名 Methods:

__init__(device)

vif(vdata[, standardize])

secretflow.stats.vdata#

secretflow.stats.SSPValue#

PVlaue 的别名 Methods:

`__init__`(spu)
`pvalues`(x, y, model)	computer pvalue for lr model

class secretflow.stats.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#

基类：object

Statistics Evaluation for a regression model on a dataset.

y_true#: FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.

y_pred#: FedNdarray y_true and y_pred must have the same device and partition shapes

r2_score#: Union[PYUObject, SPUObject]

mean_abs_err#: Union[PYUObject, SPUObject]

mean_abs_percent_err#: Union[PYUObject, SPUObject]

sum_squared_errors#: Union[PYUObject, SPUObject]

mean_squared_errors#: Union[PYUObject, SPUObject]

root_mean_squared_errors#: Union[PYUObject, SPUObject]

y_true_mean#: Union[PYUObject, SPUObject]

y_pred_mean#: Union[PYUObject, SPUObject]

residual_hist#: Union[PYUObject, SPUObject]

Methods:

`__init__`(y_true, y_pred[, spu_device, bins])
`gen_all_reports`()

__init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#

gen_all_reports()[源代码]#

class secretflow.stats.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#

基类：object

Statistics Evaluation for a bi-classification model on a dataset.

Attribute:

y_true: Union[FedNdarray, VDataFrame]: input of labels
y_score: Union[FedNdarray, VDataFrame]: input of prediction scores
bucket_size: int: input of number of bins in report

Methods:

`__init__`(y_true, y_score, bucket_size)
`get_all_reports`()	get all reports.

__init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#

get_all_reports() → PYUObject[源代码]#

get all reports. The reports contains:

summary_report: SummaryReport

group_reports: List[GroupReport]

eq_frequent_bin_report: List[EqBinReport]

eq_range_bin_report: List[EqBinReport]

head_report: List[PrReport]: reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

see more in core.biclassification_eval_core

secretflow.stats.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) → PYUObject[源代码]#

Compute Prediction Vs Actual score.

参数

actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.

compute:

result: PYUObject: Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))

secretflow.stats.table_statistics(table: Union[DataFrame, VDataFrame]) → DataFrame[源代码]#

Get table statistics for a pd.DataFrame or VDataFrame.

参数

table – Union[pd.DataFrame, VDataFrame]

返回

pd.DataFrame

moment_2 means E[X^2].

central_moment_2 means E[(X - mean(X))^2].

sum_2 means sum(X^2).

返回类型

table_statistics

secretflow.stats.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) → PYUObject[源代码]#

Calculate population stability index.

参数

X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points

返回

float: population stability index

返回类型

result

class secretflow.stats.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#

基类：object

The component provides a mapping procedure from binary regression’s probability value to an integer range score.

The mapping process is as follows:: odds = pred / (1 - pred) score = offset + factor * log(odds)
The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:: scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
The offset and factor can be solved using these three constraint parameters:: factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))

odd_base / scaled_value / pdo: see above

max_score#: up limit for score

min_score#: down limit for score

bad_label_value#: which label represents the negative sample

Methods:

`__init__`(odd_base, scaled_value, pdo[, ...])
`transform`(pred)	computer pvalue for lr model

__init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#

transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) → FedNdarray[源代码]#

computer pvalue for lr model

参数: pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
返回: mapped scores.