secretflow.stats.core package#

Submodules#

secretflow.stats.core.biclassification_eval_core module#

Classes:

Report(eq_frequent_result_arr_list, ...)

Report containing all other reports for bi-classification evaluation

PrReport(arr)

Precision Related statistics Report.

SummaryReport(arr)

Summary Report for bi-classification evaluation.

GroupReport()

Report for each group

EqBinReport(arr)

Statistics Report for each bin.

Functions:

gen_all_reports(y_true, y_score, bin_size)

Generate all reports.

create_sorted_label_score_pair(y_true, y_score)

produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order

eq_frequent_bin_evaluate(sorted_pairs, ...)

Fill eq frequent bin report.

eq_range_bin_evaluate(sorted_pairs, ...)

Fill eq range bin report.

evaluate_bins(sorted_pairs, pos_count, ...)

evaluate bins given sorted pairs, pos_count and split_points (in decreasing order)

bin_evaluate(sorted_pairs, start_pos, ...)

Evaluate statistics for a bin.

gen_pr_reports(sorted_pairs, thresholds)

Generate pr report per specified threshold.

precision_recall_false_positive_rate(...)

confusion_matrix_from_cum_counts(...)

Compute the confusion matrix.

binary_clf_curve(sorted_pairs)

Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve).

roc_curve(sorted_pairs)

Compute Receiver operating characteristic (ROC).

auc(x, y)

Compute Area Under the Curve (AUC) using the trapezoidal rule.

binary_roc_auc(sorted_pairs)

Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs.

compute_f1_score(true_positive, ...)

Calculate the F1 score.

class secretflow.stats.core.biclassification_eval_core.Report(eq_frequent_result_arr_list, eq_range_result_arr_list, summary_report_arr, head_prs)[source]#

Bases: object

Report containing all other reports for bi-classification evaluation

summary_report#

SummaryReport

group_reports#

List[GroupReport]

eq_frequent_bin_report#

List[EqBinReport]

eq_range_bin_report#

List[EqBinReport]

head_report#

List[PrReport] reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

Methods:

__init__(eq_frequent_result_arr_list, ...)

__init__(eq_frequent_result_arr_list, eq_range_result_arr_list, summary_report_arr, head_prs)[source]#
class secretflow.stats.core.biclassification_eval_core.PrReport(arr)[source]#

Bases: object

Precision Related statistics Report.

fpr#

float FP/(FP+TN)

precision#

float TP/(TP+FP)

recall#

float TP/(TP+FN)

Methods:

__init__(arr)

__init__(arr)[source]#
class secretflow.stats.core.biclassification_eval_core.SummaryReport(arr)[source]#

Bases: object

Summary Report for bi-classification evaluation.

total_samples#

int

positive_samples#

int

negative_samples#

int

auc#

float auc: area under the curve: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

ks#

float Kolmogorov-Smirnov statistics: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

f1_score#

float harmonic mean of precision and recall: https://en.wikipedia.org/wiki/F-score

Methods:

__init__(arr)

__init__(arr)[source]#
class secretflow.stats.core.biclassification_eval_core.GroupReport[source]#

Bases: object

Report for each group

Attributes:

group_name

summary

group_name: str#
summary: SummaryReport#
class secretflow.stats.core.biclassification_eval_core.EqBinReport(arr)[source]#

Bases: object

Statistics Report for each bin.

start_value#

float

end_value#

float

positive#

int

negative#

int

total#

int

precision#

float

recall#

float

false_positive_rate#

float

f1_score#

float

lift#

float see https://en.wikipedia.org/wiki/Lift_(data_mining)

predicted_positive_ratio#

float predicted positive samples / total samples.

predicted_negative_ratio#

float predicted negative samples / total samples.

cumulative_percent_of_positive#

float

cumulative_percent_of_negative#

float

total_cumulative_percent#

float

ks#

float

avg_score#

float

Methods:

__init__(arr)

__init__(arr)[source]#
secretflow.stats.core.biclassification_eval_core.gen_all_reports(y_true: Union[DataFrame, array], y_score: Union[DataFrame, array], bin_size: int)[source]#

Generate all reports.

Parameters
  • y_true – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with binary entries 1 means positive sample

  • y_score – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with each entry between [0, 1] probability of being positive

  • bin_size – int number of bins to evaluate

Returns:

secretflow.stats.core.biclassification_eval_core.create_sorted_label_score_pair(y_true: array, y_score: array)[source]#

produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order

secretflow.stats.core.biclassification_eval_core.eq_frequent_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) List[array][source]#

Fill eq frequent bin report.

Parameters
  • sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted

  • pos_count – int Total number of positive samples

  • bin_size – int Total number of bins

Returns

List[jnp.array]

Return type

bin_reports

secretflow.stats.core.biclassification_eval_core.eq_range_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) List[array][source]#

Fill eq range bin report.

Parameters
  • sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted.

  • pos_count – int Total number of positive samples

  • bin_size – int Total number of bins

Returns

List[jnp.array]

Return type

bin_reports

secretflow.stats.core.biclassification_eval_core.evaluate_bins(sorted_pairs: array, pos_count: int, split_points) List[array][source]#

evaluate bins given sorted pairs, pos_count and split_points (in decreasing order)

secretflow.stats.core.biclassification_eval_core.bin_evaluate(sorted_pairs, start_pos, end_pos, total_pos_count, total_neg_count, cumulative_pos_count, cumulative_neg_count) Tuple[array, int, int][source]#

Evaluate statistics for a bin.

Returns

jnp.array

an array of size BIN_REPORT_STATISTICS_ENTRY_COUNT

cumulative_pos_count: int

cumulative_neg_count: int

Return type

bin_report_arr

secretflow.stats.core.biclassification_eval_core.gen_pr_reports(sorted_pairs: array, thresholds: array) List[array][source]#

Generate pr report per specified threshold.

Parameters
  • sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in increasing order shape n_samples * 2

  • thresholds – 1d jnp.ndarray prediction thresholds on which to evaluate

Returns

List[jnp.array]

a list of pr reports in jnp.array of shape 3 * 1, list len = len(thresholds)

Return type

pr_report_arr

secretflow.stats.core.biclassification_eval_core.precision_recall_false_positive_rate(true_positive, false_positive, false_negative, true_negative) Tuple[float, float, float][source]#
secretflow.stats.core.biclassification_eval_core.confusion_matrix_from_cum_counts(cumulative_pos_count, cumulative_neg_count, total_neg_count, total_pos_count)[source]#

Compute the confusion matrix.

Parameters
  • cumulative_pos_count – int

  • cumulative_neg_count – int

  • total_neg_count – int

  • total_pos_count – int

Returns

int

true_negative: int

false_positive: int

false_negative: int

Return type

true_positive

secretflow.stats.core.biclassification_eval_core.binary_clf_curve(sorted_pairs: array) Tuple[array, array, array][source]#

Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve).

Parameters

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order

Returns

1d ndarray

False positives counts, index i records the number of negative samples that got assigned a score >= thresholds[i]. The total number of negative samples is equal to fps[-1] (thus true negatives are given by fps[-1] - fps)

tps: 1d ndarray

True positives counts, index i records the number of positive samples that got assigned a score >= thresholds[i]. The total number of positive samples is equal to tps[-1] (thus false negatives are given by tps[-1] - tps)

thresholds1d ndarray

Distinct predicted score sorted in decreasing order

Return type

fps

References

Github: scikit-learn _binary_clf_curve.

secretflow.stats.core.biclassification_eval_core.roc_curve(sorted_pairs: array) Tuple[array, array, array][source]#

Compute Receiver operating characteristic (ROC).

Compared to sklearn implementation, this implementation eliminates most conditionals and ill-conditionals checking.

Parameters

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order

Returns

ndarray of shape (>2,)

Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].

tpr: ndarray of shape (>2,)

Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].

thresholds: ndarray of shape = (n_thresholds,)

Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

Return type

fpr

References

Github: scikit-learn roc_curve.

secretflow.stats.core.biclassification_eval_core.auc(x, y)[source]#

Compute Area Under the Curve (AUC) using the trapezoidal rule.

Parameters
  • x – ndarray of shape (n,) monotonic X coordinates

  • y – ndarray of shape, (n,) Y coordinates

Returns

float

Area Under the Curve

Return type

auc

secretflow.stats.core.biclassification_eval_core.binary_roc_auc(sorted_pairs: array) float[source]#

Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs.

Compared to sklearn implementation, this implementation is watered down with less options and eliminates most conditionals and ill-conditionals checking.

Parameters

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order, and it has shape n_samples * 2.

Returns

float

Return type

roc_auc

References

Github: scikit-learn _binary_roc_auc_score.

secretflow.stats.core.biclassification_eval_core.compute_f1_score(true_positive: int, false_positive: int, false_negative: int) float[source]#

Calculate the F1 score.

secretflow.stats.core.psi_core module#

Functions:

psi_index(a, b)

(a - b) * ln(a/b).

psi_score(A, B)

Computes the psi score.

distribution_generation(X, split_points)

Generate a distribution of X according to split points.

psi(X, Y, split_points)

Calculate population stability index.

secretflow.stats.core.psi_core.psi_index(a, b)[source]#

(a - b) * ln(a/b).

Parameters
  • a – array or float

  • b – array or float a, b must be of same type. They can be float or jnp.array or np.array.

Returns

array or float

same type as a or b.

Return type

result

secretflow.stats.core.psi_core.psi_score(A: array, B: array)[source]#

Computes the psi score.

Parameters
  • A – jnp.array Distribution of sample A

  • B – jnp.array Distribution of sample B

Returns

float

Return type

result

secretflow.stats.core.psi_core.distribution_generation(X: array, split_points: array)[source]#

Generate a distribution of X according to split points.

Parameters
  • X – jnp.array a collection of samples

  • split_points – jnp.array an ordered sequence of split points

Returns

jnp.array

distribution in forms of percentage of counts in each bin. bin[0] is [split_points[0], split_points[1])

Return type

dist_X

secretflow.stats.core.psi_core.psi(X: Union[DataFrame, array], Y: Union[DataFrame, array], split_points: array)[source]#

Calculate population stability index.

Parameters
  • X – Union[pd.DataFrame, jnp.array] a collection of samples

  • Y – Union[pd.DataFrame, jnp.array] a collection of samples

  • split_points – jnp.array an ordered sequence of split points

Returns

float

population stability index

Return type

result

secretflow.stats.core.pva_core module#

Functions:

pva(actual, prediction, target)

Compute Prediction Vs Actual score.

secretflow.stats.core.pva_core.pva(actual: Union[DataFrame, array], prediction: Union[DataFrame, array], target)[source]#

Compute Prediction Vs Actual score.

Parameters
  • actual – Union[pd.DataFrame, jnp.array]

  • prediction – Union[pd.DataFrame, jnp.array]

  • target – numeric the target label in actual entries to consider.

Returns

float

abs(mean(prediction) - sum(actual == target)/count(actual))

Return type

result

secretflow.stats.core.utils module#

Functions:

newton_matrix_inverse(x[, iter_round])

computing the inverse of a matrix by newton iteration.

equal_obs(x, n_bin)

Equal Frequency Split Point Search in x with bin size = n_bins In each bin, there is equal number of points in them

equal_range(x, n_bin)

Equal Range Search Split Points in x with bin size = n_bins :returns: jnp.array with size n_bin+1

secretflow.stats.core.utils.newton_matrix_inverse(x: ndarray, iter_round: int = 20)[source]#

computing the inverse of a matrix by newton iteration. https://aalexan3.math.ncsu.edu/articles/mat-inv-rep.pdf

secretflow.stats.core.utils.equal_obs(x, n_bin)[source]#

Equal Frequency Split Point Search in x with bin size = n_bins In each bin, there is equal number of points in them

Parameters
  • x – array

  • n_bin – int

Returns

jnp.array with size n_bin+1

secretflow.stats.core.utils.equal_range(x, n_bin)[source]#

Equal Range Search Split Points in x with bin size = n_bins :returns: jnp.array with size n_bin+1

Module contents#