secretflow.stats package#

Subpackages#

secretflow.stats.core package

Submodules#

secretflow.stats.biclassification_eval module#

Classes:

BiClassificationEval(y_true, y_score, ...)

Statistics Evaluation for a bi-classification model on a dataset.

class secretflow.stats.biclassification_eval.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#

Bases: object

Statistics Evaluation for a bi-classification model on a dataset.

Attribute:

y_true: Union[FedNdarray, VDataFrame]: input of labels
y_score: Union[FedNdarray, VDataFrame]: input of prediction scores
bucket_size: int: input of number of bins in report

Methods:

`__init__`(y_true, y_score, bucket_size)
`get_all_reports`()	get all reports.

__init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#

get_all_reports() → PYUObject[source]#

get all reports. The reports contains:

summary_report: SummaryReport

group_reports: List[GroupReport]

eq_frequent_bin_report: List[EqBinReport]

eq_range_bin_report: List[EqBinReport]

head_report: List[PrReport]: reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

see more in core.biclassification_eval_core

secretflow.stats.psi_eval module#

Functions:

psi_eval(X, Y, split_points)

Calculate population stability index.

secretflow.stats.psi_eval.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) → PYUObject[source]#

Calculate population stability index.

Parameters

X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points

Returns

float: population stability index

Return type

result

secretflow.stats.pva_eval module#

Functions:

pva_eval(actual, prediction, target)

Compute Prediction Vs Actual score.

secretflow.stats.pva_eval.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) → PYUObject[source]#

Compute Prediction Vs Actual score.

Parameters

actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.

compute:

result: PYUObject: Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))

secretflow.stats.regression_eval module#

Classes:

RegressionEval(y_true, y_pred[, spu_device, ...])

Statistics Evaluation for a regression model on a dataset.

class secretflow.stats.regression_eval.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#

Bases: object

Statistics Evaluation for a regression model on a dataset.

y_true#: FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.

y_pred#: FedNdarray y_true and y_pred must have the same device and partition shapes

r2_score#: Union[PYUObject, SPUObject]

mean_abs_err#: Union[PYUObject, SPUObject]

mean_abs_percent_err#: Union[PYUObject, SPUObject]

sum_squared_errors#: Union[PYUObject, SPUObject]

mean_squared_errors#: Union[PYUObject, SPUObject]

root_mean_squared_errors#: Union[PYUObject, SPUObject]

y_true_mean#: Union[PYUObject, SPUObject]

y_pred_mean#: Union[PYUObject, SPUObject]

residual_hist#: Union[PYUObject, SPUObject]

Methods:

`__init__`(y_true, y_pred[, spu_device, bins])
`gen_all_reports`()

__init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#

gen_all_reports()[source]#

secretflow.stats.score_card module#

Classes:

ScoreCard(odd_base, scaled_value, pdo[, ...])

The component provides a mapping procedure from binary regression's probability value to an integer range score.

class secretflow.stats.score_card.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#

Bases: object

The component provides a mapping procedure from binary regression’s probability value to an integer range score.

The mapping process is as follows:: odds = pred / (1 - pred) score = offset + factor * log(odds)
The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:: scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
The offset and factor can be solved using these three constraint parameters:: factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))

odd_base / scaled_value / pdo: see above

max_score#: up limit for score

min_score#: down limit for score

bad_label_value#: which label represents the negative sample

Methods:

`__init__`(odd_base, scaled_value, pdo[, ...])
`transform`(pred)	computer pvalue for lr model

__init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#

transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) → FedNdarray[source]#

computer pvalue for lr model

Parameters: pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
Returns: mapped scores.

secretflow.stats.ss_pearsonr_v module#

Classes:

PearsonR(device)

Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.

class secretflow.stats.ss_pearsonr_v.PearsonR(device: SPU)[source]#

Bases: object

Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.

more detail for PearsonR: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

device#: SPU Device

Methods:

__init__(device)

pearsonr(vdata[, standardize])

vdata#

__init__(device: SPU)[source]#

pearsonr(vdata: VDataFrame, standardize: bool = True)[source]#

vdata#: VDataFrame vertical slice dataset.

standardize#: bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.

secretflow.stats.ss_pvalue_v module#

Classes:

PVlaue(spu)

Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.

class secretflow.stats.ss_pvalue_v.PVlaue(spu: SPU)[source]#

Bases: object

Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.

more detail for P-Value: https://www.w3schools.com/datascience/ds_linear_regression_pvalue.asp

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

device#: SPU Device

Methods:

`__init__`(spu)
`pvalues`(x, y, model)	computer pvalue for lr model

__init__(spu: SPU) → None[source]#

pvalues(x: VDataFrame, y: VDataFrame, model: LinearModel) → ndarray[source]#

computer pvalue for lr model

Parameters

x – VDataFrame input dataset
y – VDataFrame true label
model – LinearModel lr model

Returns

PValue

secretflow.stats.ss_vif_v module#

Classes:

VIF(device)

Calculate variance inflation factor for vertical slice dataset by using secret sharing.

class secretflow.stats.ss_vif_v.VIF(device: SPU)[source]#

Bases: object

Calculate variance inflation factor for vertical slice dataset by using secret sharing.

see https://en.wikipedia.org/wiki/Variance_inflation_factor

For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.

NOTICE: The analytical solution of matrix inversion in secret sharing is very expensive, so this method uses Newton iteration to find approximate solution.

When there is multicollinearity in the input dataset, the XTX matrix is not full rank, and the analytical solution for the inverse of the XTX matrix does not exist.

The VIF results of these linear correlational columns calculated by statsmodels are INF, indicating that the correlation is infinite. However, this method will get a large VIF value (>>1000) on these columns, which can also correctly reflect the strong correlation of these columns.

When there are constant columns in the data, the VIF result calculated by statsmodels is NAN, and the result of this method is also a large VIF value (>> 1000), means these columns need to be removed before training.

Therefore, although the results of this method cannot be completely consistent with statemodels that calculations in plain text, but they can still correctly reflect the correlation of the input data columns.

device#: SPU Device

Methods:

__init__(device)

vif(vdata[, standardize])

vdata#

__init__(device: SPU)[source]#

vif(vdata: VDataFrame, standardize: bool = True)[source]#

vdata#: VDataFrame vertical slice dataset.

standardize#: bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.

secretflow.stats.table_statistics module#

Functions:

table_statistics(table)

Get table statistics for a pd.DataFrame or VDataFrame.

secretflow.stats.table_statistics.table_statistics(table: Union[DataFrame, VDataFrame]) → DataFrame[source]#

Get table statistics for a pd.DataFrame or VDataFrame.

Parameters

table – Union[pd.DataFrame, VDataFrame]

Returns

pd.DataFrame

including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.

moment_2 means E[X^2].

central_moment_2 means E[(X - mean(X))^2].

sum_2 means sum(X^2).

Return type

table_statistics

Module contents#

Classes:

`SSVertPearsonR`	alias of `PearsonR`
`SSVertVIF`	alias of `VIF`
`SSPValue`	alias of `PVlaue`
`RegressionEval`(y_true, y_pred[, spu_device, ...])	Statistics Evaluation for a regression model on a dataset.
`BiClassificationEval`(y_true, y_score, ...)	Statistics Evaluation for a bi-classification model on a dataset.
`ScoreCard`(odd_base, scaled_value, pdo[, ...])	The component provides a mapping procedure from binary regression's probability value to an integer range score.

Functions:

`pva_eval`(actual, prediction, target)	Compute Prediction Vs Actual score.
`table_statistics`(table)	Get table statistics for a pd.DataFrame or VDataFrame.
`psi_eval`(X, Y, split_points)	Calculate population stability index.

secretflow.stats.SSVertPearsonR#

alias of PearsonR Methods:

__init__(device)

pearsonr(vdata[, standardize])

secretflow.stats.vdata#

secretflow.stats.SSVertVIF#

alias of VIF Methods:

__init__(device)

vif(vdata[, standardize])

secretflow.stats.vdata#

secretflow.stats.SSPValue#

alias of PVlaue Methods:

`__init__`(spu)
`pvalues`(x, y, model)	computer pvalue for lr model

class secretflow.stats.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#

Bases: object

Statistics Evaluation for a regression model on a dataset.

y_true#: FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.

y_pred#: FedNdarray y_true and y_pred must have the same device and partition shapes

r2_score#: Union[PYUObject, SPUObject]

mean_abs_err#: Union[PYUObject, SPUObject]

mean_abs_percent_err#: Union[PYUObject, SPUObject]

sum_squared_errors#: Union[PYUObject, SPUObject]

mean_squared_errors#: Union[PYUObject, SPUObject]

root_mean_squared_errors#: Union[PYUObject, SPUObject]

y_true_mean#: Union[PYUObject, SPUObject]

y_pred_mean#: Union[PYUObject, SPUObject]

residual_hist#: Union[PYUObject, SPUObject]

Methods:

`__init__`(y_true, y_pred[, spu_device, bins])
`gen_all_reports`()

__init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#

gen_all_reports()[source]#

class secretflow.stats.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#

Bases: object

Statistics Evaluation for a bi-classification model on a dataset.

Attribute:

y_true: Union[FedNdarray, VDataFrame]: input of labels
y_score: Union[FedNdarray, VDataFrame]: input of prediction scores
bucket_size: int: input of number of bins in report

Methods:

`__init__`(y_true, y_score, bucket_size)
`get_all_reports`()	get all reports.

__init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#

get_all_reports() → PYUObject[source]#

get all reports. The reports contains:

summary_report: SummaryReport

group_reports: List[GroupReport]

eq_frequent_bin_report: List[EqBinReport]

eq_range_bin_report: List[EqBinReport]

head_report: List[PrReport]: reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

see more in core.biclassification_eval_core

secretflow.stats.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) → PYUObject[source]#

Compute Prediction Vs Actual score.

Parameters

actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.

compute:

result: PYUObject: Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))

secretflow.stats.table_statistics(table: Union[DataFrame, VDataFrame]) → DataFrame[source]#

Get table statistics for a pd.DataFrame or VDataFrame.

Parameters

table – Union[pd.DataFrame, VDataFrame]

Returns

pd.DataFrame

moment_2 means E[X^2].

central_moment_2 means E[(X - mean(X))^2].

sum_2 means sum(X^2).

Return type

table_statistics

secretflow.stats.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) → PYUObject[source]#

Calculate population stability index.

Parameters

X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points

Returns

float: population stability index

Return type

result

class secretflow.stats.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#

Bases: object

The component provides a mapping procedure from binary regression’s probability value to an integer range score.

The mapping process is as follows:: odds = pred / (1 - pred) score = offset + factor * log(odds)
The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:: scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
The offset and factor can be solved using these three constraint parameters:: factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))

odd_base / scaled_value / pdo: see above

max_score#: up limit for score

min_score#: down limit for score

bad_label_value#: which label represents the negative sample

Methods:

`__init__`(odd_base, scaled_value, pdo[, ...])
`transform`(pred)	computer pvalue for lr model

__init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#

transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) → FedNdarray[source]#

computer pvalue for lr model

Parameters: pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
Returns: mapped scores.