secretflow.preprocessing package#

Subpackages#

Submodules#

secretflow.preprocessing.discretization module#

Classes:

KBinsDiscretizer([n_bins, strategy])

Bin continuous data into intervals.

class secretflow.preprocessing.discretization.KBinsDiscretizer(n_bins=5, strategy: str = 'quantile')[source]#

Bases: object

Bin continuous data into intervals.

This KBinsDiscretizer is almost same as sklearn.preprocessing.KBinsDiscretizer where the input and output are federated dataframe.

_discretizer#

the sklearn.preprocessing.KBinsDiscretizer instance used.

_n_bins#

The number of bins to produce.

_strategy#

{‘uniform’, ‘quantile’}, notice that ‘kmeans’ is not supported yet now.

Methods:

__init__([n_bins, strategy])

fit(df[, aggregator, comparator, ...])

Fit the estimator.

transform(df)

Discretize the data.

fit_transform(df[, aggregator, comparator, ...])

Fit the estimator with X and then transform.

__init__(n_bins=5, strategy: str = 'quantile') None[source]#
fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200) KBinsDiscretizer[source]#

Fit the estimator.

Parameters
  • df – the X to fit.

  • aggregator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.

  • comparator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.

  • compress_thres – optional; the compress threshold of HomoBinning.

  • error – optional; the error of HomoBinning.

  • max_iter – optional; the max iterations of HomoBinning.

Returns

the instance itself.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Discretize the data.

Parameters

df – the X to discretize.

Returns

the transformed X in federated dataframe.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200)[source]#

Fit the estimator with X and then transform. Just a convience combine of fit and transform methods.

secretflow.preprocessing.encoder module#

Classes:

LabelEncoder()

Encode target labels with value between 0 and n_classes-1.

OneHotEncoder([min_frequency, max_categories])

Encode categorical features as a one-hot numeric array.

class secretflow.preprocessing.encoder.LabelEncoder[source]#

Bases: object

Encode target labels with value between 0 and n_classes-1.

Just same as sklearn.preprocessing.LabelEncoder where the input/ouput is federated dataframe.

_encoder#

the sklearn LabelEncoder instance.

Examples

>>> from secretflow.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>> le.fit(df)
>>> le.transform(df)

Methods:

fit(df)

Fit label encoder.

transform(df)

Transform labels to normalized encoding.

fit_transform(df)

Fit label encoder and return encoded labels.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#

Fit label encoder.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform labels to normalized encoding.

fit_transform(df: Union[HDataFrame, VDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit label encoder and return encoded labels.

class secretflow.preprocessing.encoder.OneHotEncoder(min_frequency=None, max_categories=None)[source]#

Bases: object

Encode categorical features as a one-hot numeric array.

Just same as sklearn.preprocessing.OneHotEncoder where the input/ouput is federated dataframe.

Note: min_frequency and max_categories are calculated by partition,

so they are only available for vertical scenarios currently.

Parameters
  • min_frequency

    int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent.

    • If int, categories with a smaller cardinality will be considered

    infrequent.

    • If float, categories with a smaller cardinality than

    min_frequency * n_samples will be considered infrequent.

  • max_categories – int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

_encoder#

the sklearn OneHotEncoder instance.

Examples

>>> from secretflow.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit(df)
>>> enc.transform(df)

Methods:

__init__([min_frequency, max_categories])

fit(df)

Fit this encoder with X.

transform(df)

Transform X using one-hot encoding.

fit_transform(df)

Fit this OneHotEncoder with X, then transform X.

__init__(min_frequency=None, max_categories=None)[source]#
fit(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit this encoder with X.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform X using one-hot encoding.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit this OneHotEncoder with X, then transform X.

secretflow.preprocessing.scaler module#

Classes:

MinMaxScaler()

Transform features by scaling each feature to a given range.

StandardScaler([with_mean, with_std])

Standardize features by removing the mean and scaling to unit variance.

class secretflow.preprocessing.scaler.MinMaxScaler[source]#

Bases: object

Transform features by scaling each feature to a given range.

_scaler#

the sklearn MinMaxScaler instance.

Examples

>>> from secretflow.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit(df)
>>> scaler.transform(df)

Methods:

fit(df)

Compute the minimum and maximum for later scaling.

transform(df)

Scale features of X according to feature_range.

fit_transform(df)

Fit to X, then transform X.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#

Compute the minimum and maximum for later scaling.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Scale features of X according to feature_range.

fit_transform(df: Union[HDataFrame, VDataFrame])[source]#

Fit to X, then transform X.

class secretflow.preprocessing.scaler.StandardScaler(with_mean=True, with_std=True)[source]#

Bases: object

Standardize features by removing the mean and scaling to unit variance.

StandardScaler is similar to sklearn.preprocessing.StandardScaler. The main differences are a) takes HDataFrame/VDataFrame/MixDataFrame as input/output. b) does not support sparse matrix.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

_scaler#

the sklearn StandardScaler instance.

_with_mean#

bool, default=True if True, center the data before scaling.

_with_std#

bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).

Examples

>>> from secretflow.preprocessing import StandardScaler
>>> data = HDataFrame(...) # your HDataFrame/VDataFrame/MixDataFrame instance.
>>> scaler = StandardScaler()
>>> scaler.fit(data)
>>> print(scaler._scaler.mean_, scaler._scaler.var_)
>>> scaler.transform(data)

Methods:

__init__([with_mean, with_std])

param with_mean

optional; same as sklearn StandardScaler。

fit(df[, aggregator])

Fit a federated dataframe.

transform(df)

Transform a federated dataframe.

fit_transform(df[, aggregator])

A convenience combine of fit and transform.

__init__(with_mean=True, with_std=True) None[source]#
Parameters
  • with_mean – optional; same as sklearn StandardScaler。

  • with_std – optional; same as sklearn StandardScaler

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None)[source]#

Fit a federated dataframe.

Parameters
  • df – the X to fit.

  • aggregator – optional; the aggregator to compute global mean and standard variance. Shall provided if X is a horizontal partitioned MixDataFrame.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform a federated dataframe.

Parameters

df – the X to transform.

Returns

a federated dataframe correspondint to the input X.

fit_transform(df: Union[HDataFrame, VDataFrame], aggregator: Optional[Aggregator] = None)[source]#

A convenience combine of fit and transform.

secretflow.preprocessing.transformer module#

Classes:

LogroundTransformer([decimals, bias])

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

class secretflow.preprocessing.transformer.LogroundTransformer(decimals: int = 6, bias: float = 0.5)[source]#

Bases: _FunctionTransformer

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

Parameters
  • decimals – Number of decimal places to round each column to. Defaults to 6.

  • bias – Add bias to value before log2. Defaults to 0.5.

Methods:

__init__([decimals, bias])

__init__(decimals: int = 6, bias: float = 0.5)[source]#

Module contents#

Classes:

KBinsDiscretizer([n_bins, strategy])

Bin continuous data into intervals.

LabelEncoder()

Encode target labels with value between 0 and n_classes-1.

OneHotEncoder([min_frequency, max_categories])

Encode categorical features as a one-hot numeric array.

MinMaxScaler()

Transform features by scaling each feature to a given range.

StandardScaler([with_mean, with_std])

Standardize features by removing the mean and scaling to unit variance.

LogroundTransformer([decimals, bias])

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

class secretflow.preprocessing.KBinsDiscretizer(n_bins=5, strategy: str = 'quantile')[source]#

Bases: object

Bin continuous data into intervals.

This KBinsDiscretizer is almost same as sklearn.preprocessing.KBinsDiscretizer where the input and output are federated dataframe.

_discretizer#

the sklearn.preprocessing.KBinsDiscretizer instance used.

_n_bins#

The number of bins to produce.

_strategy#

{‘uniform’, ‘quantile’}, notice that ‘kmeans’ is not supported yet now.

Methods:

__init__([n_bins, strategy])

fit(df[, aggregator, comparator, ...])

Fit the estimator.

transform(df)

Discretize the data.

fit_transform(df[, aggregator, comparator, ...])

Fit the estimator with X and then transform.

__init__(n_bins=5, strategy: str = 'quantile') None[source]#
fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200) KBinsDiscretizer[source]#

Fit the estimator.

Parameters
  • df – the X to fit.

  • aggregator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.

  • comparator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.

  • compress_thres – optional; the compress threshold of HomoBinning.

  • error – optional; the error of HomoBinning.

  • max_iter – optional; the max iterations of HomoBinning.

Returns

the instance itself.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Discretize the data.

Parameters

df – the X to discretize.

Returns

the transformed X in federated dataframe.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200)[source]#

Fit the estimator with X and then transform. Just a convience combine of fit and transform methods.

class secretflow.preprocessing.LabelEncoder[source]#

Bases: object

Encode target labels with value between 0 and n_classes-1.

Just same as sklearn.preprocessing.LabelEncoder where the input/ouput is federated dataframe.

_encoder#

the sklearn LabelEncoder instance.

Examples

>>> from secretflow.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>> le.fit(df)
>>> le.transform(df)

Methods:

fit(df)

Fit label encoder.

transform(df)

Transform labels to normalized encoding.

fit_transform(df)

Fit label encoder and return encoded labels.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#

Fit label encoder.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform labels to normalized encoding.

fit_transform(df: Union[HDataFrame, VDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit label encoder and return encoded labels.

class secretflow.preprocessing.OneHotEncoder(min_frequency=None, max_categories=None)[source]#

Bases: object

Encode categorical features as a one-hot numeric array.

Just same as sklearn.preprocessing.OneHotEncoder where the input/ouput is federated dataframe.

Note: min_frequency and max_categories are calculated by partition,

so they are only available for vertical scenarios currently.

Parameters
  • min_frequency

    int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent.

    • If int, categories with a smaller cardinality will be considered

    infrequent.

    • If float, categories with a smaller cardinality than

    min_frequency * n_samples will be considered infrequent.

  • max_categories – int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

_encoder#

the sklearn OneHotEncoder instance.

Examples

>>> from secretflow.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit(df)
>>> enc.transform(df)

Methods:

__init__([min_frequency, max_categories])

fit(df)

Fit this encoder with X.

transform(df)

Transform X using one-hot encoding.

fit_transform(df)

Fit this OneHotEncoder with X, then transform X.

__init__(min_frequency=None, max_categories=None)[source]#
fit(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit this encoder with X.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform X using one-hot encoding.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Fit this OneHotEncoder with X, then transform X.

class secretflow.preprocessing.MinMaxScaler[source]#

Bases: object

Transform features by scaling each feature to a given range.

_scaler#

the sklearn MinMaxScaler instance.

Examples

>>> from secretflow.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit(df)
>>> scaler.transform(df)

Methods:

fit(df)

Compute the minimum and maximum for later scaling.

transform(df)

Scale features of X according to feature_range.

fit_transform(df)

Fit to X, then transform X.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#

Compute the minimum and maximum for later scaling.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Scale features of X according to feature_range.

fit_transform(df: Union[HDataFrame, VDataFrame])[source]#

Fit to X, then transform X.

class secretflow.preprocessing.StandardScaler(with_mean=True, with_std=True)[source]#

Bases: object

Standardize features by removing the mean and scaling to unit variance.

StandardScaler is similar to sklearn.preprocessing.StandardScaler. The main differences are a) takes HDataFrame/VDataFrame/MixDataFrame as input/output. b) does not support sparse matrix.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

_scaler#

the sklearn StandardScaler instance.

_with_mean#

bool, default=True if True, center the data before scaling.

_with_std#

bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).

Examples

>>> from secretflow.preprocessing import StandardScaler
>>> data = HDataFrame(...) # your HDataFrame/VDataFrame/MixDataFrame instance.
>>> scaler = StandardScaler()
>>> scaler.fit(data)
>>> print(scaler._scaler.mean_, scaler._scaler.var_)
>>> scaler.transform(data)

Methods:

__init__([with_mean, with_std])

param with_mean

optional; same as sklearn StandardScaler。

fit(df[, aggregator])

Fit a federated dataframe.

transform(df)

Transform a federated dataframe.

fit_transform(df[, aggregator])

A convenience combine of fit and transform.

__init__(with_mean=True, with_std=True) None[source]#
Parameters
  • with_mean – optional; same as sklearn StandardScaler。

  • with_std – optional; same as sklearn StandardScaler

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None)[source]#

Fit a federated dataframe.

Parameters
  • df – the X to fit.

  • aggregator – optional; the aggregator to compute global mean and standard variance. Shall provided if X is a horizontal partitioned MixDataFrame.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform a federated dataframe.

Parameters

df – the X to transform.

Returns

a federated dataframe correspondint to the input X.

fit_transform(df: Union[HDataFrame, VDataFrame], aggregator: Optional[Aggregator] = None)[source]#

A convenience combine of fit and transform.

class secretflow.preprocessing.LogroundTransformer(decimals: int = 6, bias: float = 0.5)[source]#

Bases: _FunctionTransformer

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

Parameters
  • decimals – Number of decimal places to round each column to. Defaults to 6.

  • bias – Add bias to value before log2. Defaults to 0.5.

Methods:

__init__([decimals, bias])

__init__(decimals: int = 6, bias: float = 0.5)[source]#