secretflow.stats package#
Subpackages#
Submodules#
secretflow.stats.biclassification_eval module#
Classes:
|
Statistics Evaluation for a bi-classification model on a dataset. |
- class secretflow.stats.biclassification_eval.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#
Bases:
object
Statistics Evaluation for a bi-classification model on a dataset.
- Attribute:
- y_true: Union[FedNdarray, VDataFrame]
input of labels
- y_score: Union[FedNdarray, VDataFrame]
input of prediction scores
- bucket_size: int
input of number of bins in report
Methods:
__init__
(y_true, y_score, bucket_size)get all reports.
- __init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#
- get_all_reports() PYUObject [source]#
get all reports. The reports contains:
summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
- head_report: List[PrReport]
reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2
see more in core.biclassification_eval_core
secretflow.stats.psi_eval module#
Functions:
|
Calculate population stability index. |
- secretflow.stats.psi_eval.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) PYUObject [source]#
Calculate population stability index.
- Parameters
X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points
- Returns
- float
population stability index
- Return type
result
secretflow.stats.pva_eval module#
Functions:
|
Compute Prediction Vs Actual score. |
- secretflow.stats.pva_eval.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) PYUObject [source]#
Compute Prediction Vs Actual score.
- Parameters
actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.
- compute:
- result: PYUObject
Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))
secretflow.stats.regression_eval module#
Classes:
|
Statistics Evaluation for a regression model on a dataset. |
- class secretflow.stats.regression_eval.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#
Bases:
object
Statistics Evaluation for a regression model on a dataset.
- y_true#
FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.
- y_pred#
FedNdarray y_true and y_pred must have the same device and partition shapes
- r2_score#
Union[PYUObject, SPUObject]
- mean_abs_err#
Union[PYUObject, SPUObject]
- mean_abs_percent_err#
Union[PYUObject, SPUObject]
- sum_squared_errors#
Union[PYUObject, SPUObject]
- mean_squared_errors#
Union[PYUObject, SPUObject]
- root_mean_squared_errors#
Union[PYUObject, SPUObject]
- y_true_mean#
Union[PYUObject, SPUObject]
- y_pred_mean#
Union[PYUObject, SPUObject]
- residual_hist#
Union[PYUObject, SPUObject]
Methods:
__init__
(y_true, y_pred[, spu_device, bins])- __init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#
secretflow.stats.score_card module#
Classes:
|
The component provides a mapping procedure from binary regression's probability value to an integer range score. |
- class secretflow.stats.score_card.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#
Bases:
object
The component provides a mapping procedure from binary regression’s probability value to an integer range score.
- The mapping process is as follows:
odds = pred / (1 - pred) score = offset + factor * log(odds)
- The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:
scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
- The offset and factor can be solved using these three constraint parameters:
factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))
- odd_base / scaled_value / pdo
see above
- max_score#
up limit for score
- min_score#
down limit for score
- bad_label_value#
which label represents the negative sample
Methods:
__init__
(odd_base, scaled_value, pdo[, ...])transform
(pred)computer pvalue for lr model
- __init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#
- transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) FedNdarray [source]#
computer pvalue for lr model
- Parameters
pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
- Returns
mapped scores.
secretflow.stats.ss_pearsonr_v module#
Classes:
|
Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_pearsonr_v.PearsonR(device: SPU)[source]#
Bases:
object
Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.
more detail for PearsonR: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
- device#
SPU Device
Methods:
__init__
(device)pearsonr
(vdata[, standardize])- vdata#
- pearsonr(vdata: VDataFrame, standardize: bool = True)[source]#
- vdata#
VDataFrame vertical slice dataset.
- standardize#
bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.
secretflow.stats.ss_pvalue_v module#
Classes:
|
Calculate P-Value for LR model training on vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_pvalue_v.PVlaue(spu: SPU)[source]#
Bases:
object
Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.
more detail for P-Value: https://www.w3schools.com/datascience/ds_linear_regression_pvalue.asp
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
- device#
SPU Device
Methods:
__init__
(spu)pvalues
(x, y, model)computer pvalue for lr model
- pvalues(x: VDataFrame, y: VDataFrame, model: LinearModel) ndarray [source]#
computer pvalue for lr model
- Parameters
x – VDataFrame input dataset
y – VDataFrame true label
model – LinearModel lr model
- Returns
PValue
secretflow.stats.ss_vif_v module#
Classes:
|
Calculate variance inflation factor for vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_vif_v.VIF(device: SPU)[source]#
Bases:
object
Calculate variance inflation factor for vertical slice dataset by using secret sharing.
see https://en.wikipedia.org/wiki/Variance_inflation_factor
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
NOTICE: The analytical solution of matrix inversion in secret sharing is very expensive, so this method uses Newton iteration to find approximate solution.
When there is multicollinearity in the input dataset, the XTX matrix is not full rank, and the analytical solution for the inverse of the XTX matrix does not exist.
The VIF results of these linear correlational columns calculated by statsmodels are INF, indicating that the correlation is infinite. However, this method will get a large VIF value (>>1000) on these columns, which can also correctly reflect the strong correlation of these columns.
When there are constant columns in the data, the VIF result calculated by statsmodels is NAN, and the result of this method is also a large VIF value (>> 1000), means these columns need to be removed before training.
Therefore, although the results of this method cannot be completely consistent with statemodels that calculations in plain text, but they can still correctly reflect the correlation of the input data columns.
- device#
SPU Device
Methods:
__init__
(device)vif
(vdata[, standardize])- vdata#
- vif(vdata: VDataFrame, standardize: bool = True)[source]#
- vdata#
VDataFrame vertical slice dataset.
- standardize#
bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.
secretflow.stats.table_statistics module#
Functions:
|
Get table statistics for a pd.DataFrame or VDataFrame. |
- secretflow.stats.table_statistics.table_statistics(table: Union[DataFrame, VDataFrame]) DataFrame [source]#
Get table statistics for a pd.DataFrame or VDataFrame.
- Parameters
table – Union[pd.DataFrame, VDataFrame]
- Returns
- pd.DataFrame
including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.
moment_2 means E[X^2].
central_moment_2 means E[(X - mean(X))^2].
sum_2 means sum(X^2).
- Return type
table_statistics
Module contents#
Classes:
alias of |
|
alias of |
|
alias of |
|
|
Statistics Evaluation for a regression model on a dataset. |
|
Statistics Evaluation for a bi-classification model on a dataset. |
|
The component provides a mapping procedure from binary regression's probability value to an integer range score. |
Functions:
|
Compute Prediction Vs Actual score. |
|
Get table statistics for a pd.DataFrame or VDataFrame. |
|
Calculate population stability index. |
- secretflow.stats.SSVertPearsonR#
alias of
PearsonR
Methods:__init__
(device)pearsonr
(vdata[, standardize])- secretflow.stats.vdata#
- secretflow.stats.SSVertVIF#
alias of
VIF
Methods:__init__
(device)vif
(vdata[, standardize])- secretflow.stats.vdata#
- secretflow.stats.SSPValue#
alias of
PVlaue
Methods:__init__
(spu)pvalues
(x, y, model)computer pvalue for lr model
- class secretflow.stats.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#
Bases:
object
Statistics Evaluation for a regression model on a dataset.
- y_true#
FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then a SPU device is required and each statistics is a SPUObject.
- y_pred#
FedNdarray y_true and y_pred must have the same device and partition shapes
- r2_score#
Union[PYUObject, SPUObject]
- mean_abs_err#
Union[PYUObject, SPUObject]
- mean_abs_percent_err#
Union[PYUObject, SPUObject]
- sum_squared_errors#
Union[PYUObject, SPUObject]
- mean_squared_errors#
Union[PYUObject, SPUObject]
- root_mean_squared_errors#
Union[PYUObject, SPUObject]
- y_true_mean#
Union[PYUObject, SPUObject]
- y_pred_mean#
Union[PYUObject, SPUObject]
- residual_hist#
Union[PYUObject, SPUObject]
Methods:
__init__
(y_true, y_pred[, spu_device, bins])- __init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[source]#
- class secretflow.stats.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#
Bases:
object
Statistics Evaluation for a bi-classification model on a dataset.
- Attribute:
- y_true: Union[FedNdarray, VDataFrame]
input of labels
- y_score: Union[FedNdarray, VDataFrame]
input of prediction scores
- bucket_size: int
input of number of bins in report
Methods:
__init__
(y_true, y_score, bucket_size)get all reports.
- __init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[source]#
- get_all_reports() PYUObject [source]#
get all reports. The reports contains:
summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
- head_report: List[PrReport]
reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2
see more in core.biclassification_eval_core
- secretflow.stats.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) PYUObject [source]#
Compute Prediction Vs Actual score.
- Parameters
actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.
- compute:
- result: PYUObject
Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))
- secretflow.stats.table_statistics(table: Union[DataFrame, VDataFrame]) DataFrame [source]#
Get table statistics for a pd.DataFrame or VDataFrame.
- Parameters
table – Union[pd.DataFrame, VDataFrame]
- Returns
- pd.DataFrame
including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.
moment_2 means E[X^2].
central_moment_2 means E[(X - mean(X))^2].
sum_2 means sum(X^2).
- Return type
table_statistics
- secretflow.stats.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) PYUObject [source]#
Calculate population stability index.
- Parameters
X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points
- Returns
- float
population stability index
- Return type
result
- class secretflow.stats.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#
Bases:
object
The component provides a mapping procedure from binary regression’s probability value to an integer range score.
- The mapping process is as follows:
odds = pred / (1 - pred) score = offset + factor * log(odds)
- The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:
scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
- The offset and factor can be solved using these three constraint parameters:
factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))
- odd_base / scaled_value / pdo
see above
- max_score#
up limit for score
- min_score#
down limit for score
- bad_label_value#
which label represents the negative sample
Methods:
__init__
(odd_base, scaled_value, pdo[, ...])transform
(pred)computer pvalue for lr model
- __init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[source]#
- transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) FedNdarray [source]#
computer pvalue for lr model
- Parameters
pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
- Returns
mapped scores.