secretflow.preprocessing package#

Subpackages#

secretflow.preprocessing.binning package

Submodules#

secretflow.preprocessing.discretization module#

Classes:

KBinsDiscretizer([n_bins, strategy])

Bin continuous data into intervals.

class secretflow.preprocessing.discretization.KBinsDiscretizer(n_bins=5, strategy: str = 'quantile')[source]#

Bases: object

Bin continuous data into intervals.

This KBinsDiscretizer is almost same as sklearn.preprocessing.KBinsDiscretizer where the input and output are federated dataframe.

_discretizer#: the sklearn.preprocessing.KBinsDiscretizer instance used.

_n_bins#: The number of bins to produce.

_strategy#: {‘uniform’, ‘quantile’}, notice that ‘kmeans’ is not supported yet now.

Methods:

`__init__`([n_bins, strategy])
`fit`(df[, aggregator, comparator, ...])	Fit the estimator.
`transform`(df)	Discretize the data.
`fit_transform`(df[, aggregator, comparator, ...])	Fit the estimator with X and then transform.

__init__(n_bins=5, strategy: str = 'quantile') → None[source]#

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200) → KBinsDiscretizer[source]#

Fit the estimator.

Parameters

df – the X to fit.
aggregator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.
comparator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.
compress_thres – optional; the compress threshold of HomoBinning.
error – optional; the error of HomoBinning.
max_iter – optional; the max iterations of HomoBinning.

Returns

the instance itself.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Discretize the data.

Parameters: df – the X to discretize.
Returns: the transformed X in federated dataframe.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200)[source]#: Fit the estimator with X and then transform. Just a convience combine of fit and transform methods.

secretflow.preprocessing.encoder module#

Classes:

`LabelEncoder`()	Encode target labels with value between 0 and n_classes-1.
`OneHotEncoder`([min_frequency, max_categories])	Encode categorical features as a one-hot numeric array.

class secretflow.preprocessing.encoder.LabelEncoder[source]#

Bases: object

Encode target labels with value between 0 and n_classes-1.

Just same as sklearn.preprocessing.LabelEncoder where the input/ouput is federated dataframe.

_encoder#: the sklearn LabelEncoder instance.

Examples

>>> from secretflow.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>> le.fit(df)
>>> le.transform(df)

Methods:

`fit`(df)	Fit label encoder.
`transform`(df)	Transform labels to normalized encoding.
`fit_transform`(df)	Fit label encoder and return encoded labels.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#: Fit label encoder.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Transform labels to normalized encoding.

fit_transform(df: Union[HDataFrame, VDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit label encoder and return encoded labels.

class secretflow.preprocessing.encoder.OneHotEncoder(min_frequency=None, max_categories=None)[source]#

Bases: object

Encode categorical features as a one-hot numeric array.

Just same as sklearn.preprocessing.OneHotEncoder where the input/ouput is federated dataframe.

Note: min_frequency and max_categories are calculated by partition,: so they are only available for vertical scenarios currently.

Parameters

min_frequency –
int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent.
- If int, categories with a smaller cardinality will be considered
infrequent.
- If float, categories with a smaller cardinality than
min_frequency * n_samples will be considered infrequent.
max_categories – int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

_encoder#: the sklearn OneHotEncoder instance.

Examples

>>> from secretflow.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit(df)
>>> enc.transform(df)

Methods:

`__init__`([min_frequency, max_categories])
`fit`(df)	Fit this encoder with X.
`transform`(df)	Transform X using one-hot encoding.
`fit_transform`(df)	Fit this OneHotEncoder with X, then transform X.

__init__(min_frequency=None, max_categories=None)[source]#

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit this encoder with X.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Transform X using one-hot encoding.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit this OneHotEncoder with X, then transform X.

secretflow.preprocessing.scaler module#

Classes:

`MinMaxScaler`()	Transform features by scaling each feature to a given range.
`StandardScaler`([with_mean, with_std])	Standardize features by removing the mean and scaling to unit variance.

class secretflow.preprocessing.scaler.MinMaxScaler[source]#

Bases: object

Transform features by scaling each feature to a given range.

_scaler#: the sklearn MinMaxScaler instance.

Examples

>>> from secretflow.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit(df)
>>> scaler.transform(df)

Methods:

`fit`(df)	Compute the minimum and maximum for later scaling.
`transform`(df)	Scale features of X according to feature_range.
`fit_transform`(df)	Fit to X, then transform X.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#: Compute the minimum and maximum for later scaling.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Scale features of X according to feature_range.

fit_transform(df: Union[HDataFrame, VDataFrame])[source]#: Fit to X, then transform X.

class secretflow.preprocessing.scaler.StandardScaler(with_mean=True, with_std=True)[source]#

Bases: object

Standardize features by removing the mean and scaling to unit variance.

StandardScaler is similar to sklearn.preprocessing.StandardScaler. The main differences are a) takes HDataFrame/VDataFrame/MixDataFrame as input/output. b) does not support sparse matrix.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

_scaler#: the sklearn StandardScaler instance.

_with_mean#: bool, default=True if True, center the data before scaling.

_with_std#: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).

Examples

>>> from secretflow.preprocessing import StandardScaler
>>> data = HDataFrame(...) # your HDataFrame/VDataFrame/MixDataFrame instance.
>>> scaler = StandardScaler()
>>> scaler.fit(data)
>>> print(scaler._scaler.mean_, scaler._scaler.var_)
>>> scaler.transform(data)

Methods:

`__init__`([with_mean, with_std])	param with_mean optional; same as sklearn StandardScaler。
`fit`(df[, aggregator])	Fit a federated dataframe.
`transform`(df)	Transform a federated dataframe.
`fit_transform`(df[, aggregator])	A convenience combine of fit and transform.

__init__(with_mean=True, with_std=True) → None[source]#

Parameters

with_mean – optional; same as sklearn StandardScaler。
with_std – optional; same as sklearn StandardScaler

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None)[source]#

Fit a federated dataframe.

Parameters

df – the X to fit.
aggregator – optional; the aggregator to compute global mean and standard variance. Shall provided if X is a horizontal partitioned MixDataFrame.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform a federated dataframe.

Parameters: df – the X to transform.
Returns: a federated dataframe correspondint to the input X.

fit_transform(df: Union[HDataFrame, VDataFrame], aggregator: Optional[Aggregator] = None)[source]#: A convenience combine of fit and transform.

secretflow.preprocessing.transformer module#

Classes:

LogroundTransformer([decimals, bias])

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

class secretflow.preprocessing.transformer.LogroundTransformer(decimals: int = 6, bias: float = 0.5)[source]#

Bases: _FunctionTransformer

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

Parameters

decimals – Number of decimal places to round each column to. Defaults to 6.
bias – Add bias to value before log2. Defaults to 0.5.

Methods:

__init__([decimals, bias])

__init__(decimals: int = 6, bias: float = 0.5)[source]#

Module contents#

Classes:

`KBinsDiscretizer`([n_bins, strategy])	Bin continuous data into intervals.
`LabelEncoder`()	Encode target labels with value between 0 and n_classes-1.
`OneHotEncoder`([min_frequency, max_categories])	Encode categorical features as a one-hot numeric array.
`MinMaxScaler`()	Transform features by scaling each feature to a given range.
`StandardScaler`([with_mean, with_std])	Standardize features by removing the mean and scaling to unit variance.
`LogroundTransformer`([decimals, bias])	Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

class secretflow.preprocessing.KBinsDiscretizer(n_bins=5, strategy: str = 'quantile')[source]#

Bases: object

Bin continuous data into intervals.

This KBinsDiscretizer is almost same as sklearn.preprocessing.KBinsDiscretizer where the input and output are federated dataframe.

_discretizer#: the sklearn.preprocessing.KBinsDiscretizer instance used.

_n_bins#: The number of bins to produce.

_strategy#: {‘uniform’, ‘quantile’}, notice that ‘kmeans’ is not supported yet now.

Methods:

`__init__`([n_bins, strategy])
`fit`(df[, aggregator, comparator, ...])	Fit the estimator.
`transform`(df)	Discretize the data.
`fit_transform`(df[, aggregator, comparator, ...])	Fit the estimator with X and then transform.

__init__(n_bins=5, strategy: str = 'quantile') → None[source]#

Fit the estimator.

Parameters

df – the X to fit.
aggregator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.
comparator – optional; shall be provided if df is a horizontal partitioned MixDataFrame.
compress_thres – optional; the compress threshold of HomoBinning.
error – optional; the error of HomoBinning.
max_iter – optional; the max iterations of HomoBinning.

Returns

the instance itself.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Discretize the data.

Parameters: df – the X to discretize.
Returns: the transformed X in federated dataframe.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, compress_thres: int = 10000, error: float = 10000.0, max_iter: int = 200)[source]#: Fit the estimator with X and then transform. Just a convience combine of fit and transform methods.

class secretflow.preprocessing.LabelEncoder[source]#

Bases: object

Encode target labels with value between 0 and n_classes-1.

Just same as sklearn.preprocessing.LabelEncoder where the input/ouput is federated dataframe.

_encoder#: the sklearn LabelEncoder instance.

Examples

>>> from secretflow.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>> le.fit(df)
>>> le.transform(df)

Methods:

`fit`(df)	Fit label encoder.
`transform`(df)	Transform labels to normalized encoding.
`fit_transform`(df)	Fit label encoder and return encoded labels.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#: Fit label encoder.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Transform labels to normalized encoding.

fit_transform(df: Union[HDataFrame, VDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit label encoder and return encoded labels.

class secretflow.preprocessing.OneHotEncoder(min_frequency=None, max_categories=None)[source]#

Bases: object

Encode categorical features as a one-hot numeric array.

Just same as sklearn.preprocessing.OneHotEncoder where the input/ouput is federated dataframe.

Note: min_frequency and max_categories are calculated by partition,: so they are only available for vertical scenarios currently.

Parameters

min_frequency –
int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent.
- If int, categories with a smaller cardinality will be considered
infrequent.
- If float, categories with a smaller cardinality than
min_frequency * n_samples will be considered infrequent.
max_categories – int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

_encoder#: the sklearn OneHotEncoder instance.

Examples

>>> from secretflow.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit(df)
>>> enc.transform(df)

Methods:

`__init__`([min_frequency, max_categories])
`fit`(df)	Fit this encoder with X.
`transform`(df)	Transform X using one-hot encoding.
`fit_transform`(df)	Fit this OneHotEncoder with X, then transform X.

__init__(min_frequency=None, max_categories=None)[source]#

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit this encoder with X.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Transform X using one-hot encoding.

fit_transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Fit this OneHotEncoder with X, then transform X.

class secretflow.preprocessing.MinMaxScaler[source]#

Bases: object

Transform features by scaling each feature to a given range.

_scaler#: the sklearn MinMaxScaler instance.

Examples

>>> from secretflow.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit(df)
>>> scaler.transform(df)

Methods:

`fit`(df)	Compute the minimum and maximum for later scaling.
`transform`(df)	Scale features of X according to feature_range.
`fit_transform`(df)	Fit to X, then transform X.

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame])[source]#: Compute the minimum and maximum for later scaling.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#: Scale features of X according to feature_range.

fit_transform(df: Union[HDataFrame, VDataFrame])[source]#: Fit to X, then transform X.

class secretflow.preprocessing.StandardScaler(with_mean=True, with_std=True)[source]#

Bases: object

Standardize features by removing the mean and scaling to unit variance.

StandardScaler is similar to sklearn.preprocessing.StandardScaler. The main differences are a) takes HDataFrame/VDataFrame/MixDataFrame as input/output. b) does not support sparse matrix.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

_scaler#: the sklearn StandardScaler instance.

_with_mean#: bool, default=True if True, center the data before scaling.

_with_std#: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).

Examples

>>> from secretflow.preprocessing import StandardScaler
>>> data = HDataFrame(...) # your HDataFrame/VDataFrame/MixDataFrame instance.
>>> scaler = StandardScaler()
>>> scaler.fit(data)
>>> print(scaler._scaler.mean_, scaler._scaler.var_)
>>> scaler.transform(data)

Methods:

`__init__`([with_mean, with_std])	param with_mean optional; same as sklearn StandardScaler。
`fit`(df[, aggregator])	Fit a federated dataframe.
`transform`(df)	Transform a federated dataframe.
`fit_transform`(df[, aggregator])	A convenience combine of fit and transform.

__init__(with_mean=True, with_std=True) → None[source]#

Parameters

with_mean – optional; same as sklearn StandardScaler。
with_std – optional; same as sklearn StandardScaler

fit(df: Union[HDataFrame, VDataFrame, MixDataFrame], aggregator: Optional[Aggregator] = None)[source]#

Fit a federated dataframe.

Parameters

df – the X to fit.
aggregator – optional; the aggregator to compute global mean and standard variance. Shall provided if X is a horizontal partitioned MixDataFrame.

transform(df: Union[HDataFrame, VDataFrame, MixDataFrame]) → Union[HDataFrame, VDataFrame, MixDataFrame][source]#

Transform a federated dataframe.

Parameters: df – the X to transform.
Returns: a federated dataframe correspondint to the input X.

fit_transform(df: Union[HDataFrame, VDataFrame], aggregator: Optional[Aggregator] = None)[source]#: A convenience combine of fit and transform.

class secretflow.preprocessing.LogroundTransformer(decimals: int = 6, bias: float = 0.5)[source]#

Bases: _FunctionTransformer

Constructs a transformer for calculating round(log2(x + bias)) of (partition of) dataframe.

Parameters

decimals – Number of decimal places to round each column to. Defaults to 6.
bias – Add bias to value before log2. Defaults to 0.5.

Methods:

__init__([decimals, bias])

__init__(decimals: int = 6, bias: float = 0.5)[source]#