secretflow.data package#

Submodules#

secretflow.data.base module#

Classes:

`DataFrameBase`()	Abstract base class for horizontal, vertical and mixed partitioned DataFrame
`Partition`([data])	Slice of data that makes up horizontal, vertical and mixed partitioned DataFrame.

class secretflow.data.base.DataFrameBase[源代码]#

基类：ABC

Abstract base class for horizontal, vertical and mixed partitioned DataFrame

Methods:

`min`()	Gets minimum value of all columns
`max`()	Gets maximum value of all columns
`count`()	Gets number of rows
`values`()	Get underlying ndarray

abstract min()[源代码]#: Gets minimum value of all columns

abstract max()[源代码]#: Gets maximum value of all columns

abstract count()[源代码]#: Gets number of rows

abstract values()[源代码]#: Get underlying ndarray

class secretflow.data.base.Partition(data: Optional[PYUObject] = None)[源代码]#

基类：DataFrameBase

Slice of data that makes up horizontal, vertical and mixed partitioned DataFrame.

data#

Reference to pandas.DataFrame located in local node.

Type: PYUObject

Attributes:

`data`
`values`	Returns the underlying ndarray.
`index`	Returns the index (row labels) of the DataFrame.
`dtypes`	Returns the dtypes in the DataFrame.
`columns`	Returns the column labels of the DataFrame.
`shape`	Returns a tuple representing the dimensionality of the DataFrame.

Methods:

`mean`(args, *kwargs)	Returns the mean of the values over the requested axis.
`var`(args, *kwargs)	Returns the variance of the values over the requested axis.
`std`(args, *kwargs)	Returns the standard deviation of the values over the requested axis.
`sem`(args, *kwargs)	Returns the standard error of the mean over the requested axis.
`skew`(args, *kwargs)	Returns the skewness over the requested axis.
`kurtosis`(args, *kwargs)	Returns the kurtosis over the requested axis.
`sum`(args, *kwargs)	Returns the sum of the values over the requested axis.
`replace`(args, *kwargs)	Replace values given in to_replace with value.
`quantile`([q, axis])	Returns values at the given quantile over requested axis.
`min`(args, *kwargs)	Returns the minimum of the values over the requested axis.
`mode`(args, *kwargs)	Returns the mode of the values over the requested axis.
`max`(args, *kwargs)	Returns the maximum of the values over the requested axis.
`count`(args, *kwargs)	Counts non-NA cells for each column or row.
`isna`()	Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
`pow`(args, *kwargs)	Gets Exponential power of (partition of) dataframe and other, element-wise (binary operator pow).
`subtract`(args, *kwargs)	Gets Subtraction of (partition of) dataframe and other, element-wise (binary operator sub).
`round`(args, *kwargs)	Round the (partition of) DataFrame to a variable number of decimal places.
`select_dtypes`(args, *kwargs)	Returns a subset of the DataFrame's columns based on the column dtypes.
`astype`(dtype[, copy, errors])	Cast a pandas object to a specified dtype `dtype`.
`iloc`(index)	Integer-location based indexing for selection by position.
`drop`([labels, axis, index, columns, level, ...])	See pandas.DataFrame.drop
`fillna`([value, method, axis, inplace, ...])	See `pandas.DataFrame.fillna()`
`rename`([mapper, index, columns, axis, copy, ...])	See `pandas.DataFrame.rename()`
`value_counts`(args, *kwargs)	Return a Series containing counts of unique values.
`to_csv`(filepath, **kwargs)	Save DataFrame to csv file.
`copy`()	Shallow copy.
`__init__`([data])

data: PYUObject = None#

mean(*args, **kwargs) → Partition[源代码]#

Returns the mean of the values over the requested axis.

返回: mean values series.
返回类型: Partition

var(*args, **kwargs) → Partition[源代码]#

Returns the variance of the values over the requested axis.

返回: variance values series.
返回类型: Partition

std(*args, **kwargs) → Partition[源代码]#

Returns the standard deviation of the values over the requested axis.

返回: standard deviation values series.
返回类型: Partition

sem(*args, **kwargs) → Partition[源代码]#

Returns the standard error of the mean over the requested axis.

返回: standard error of the mean series.
返回类型: Partition

skew(*args, **kwargs) → Partition[源代码]#

Returns the skewness over the requested axis.

返回: skewness series.
返回类型: Partition

kurtosis(*args, **kwargs) → Partition[源代码]#

Returns the kurtosis over the requested axis.

返回: kurtosis series.
返回类型: Partition

sum(*args, **kwargs) → Partition[源代码]#

Returns the sum of the values over the requested axis.

返回: sum values series.
返回类型: Partition

replace(*args, **kwargs) → Partition[源代码]#

Replace values given in to_replace with value. Same as pandas.DataFrame.replace Values of the DataFrame are replaced with other values dynamically.

返回: same shape except value replaced
返回类型: Partition

quantile(q=0.5, axis=0) → Partition[源代码]#

Returns values at the given quantile over requested axis.

返回: quantile values series.
返回类型: Partition

min(*args, **kwargs) → Partition[源代码]#

Returns the minimum of the values over the requested axis.

返回: minimum values series.
返回类型: Partition

mode(*args, **kwargs) → Partition[源代码]#

Returns the mode of the values over the requested axis.

For data protection reasons, only one mode will be returned.

返回: mode values series.
返回类型: Partition

max(*args, **kwargs) → Partition[源代码]#

Returns the maximum of the values over the requested axis.

返回: maximum values series.
返回类型: Partition

count(*args, **kwargs) → Partition[源代码]#

Counts non-NA cells for each column or row.

返回: count values series.
返回类型: Partition

isna() → Partition[源代码]#

Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns

pow(*args, **kwargs) → Partition[源代码]#

Gets Exponential power of (partition of) dataframe and other, element-wise (binary operator pow). Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.

Reference:: pd.DataFrame.pow

subtract(*args, **kwargs) → Partition[源代码]#

Gets Subtraction of (partition of) dataframe and other, element-wise (binary operator sub). Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.

Reference:: pd.DataFrame.subtract

round(*args, **kwargs) → Partition[源代码]#

Round the (partition of) DataFrame to a variable number of decimal places.

Reference:: pd.DataFrame.round

select_dtypes(*args, **kwargs) → Partition[源代码]#

Returns a subset of the DataFrame’s columns based on the column dtypes.

Reference:: pandas.DataFrame.select_dtypes

property values#: Returns the underlying ndarray.

property index#: Returns the index (row labels) of the DataFrame.

property dtypes#: Returns the dtypes in the DataFrame.

astype(dtype, copy: bool = True, errors: str = 'raise')[源代码]#

Cast a pandas object to a specified dtype dtype.

All args are same as pandas.DataFrame.astype().

property columns#: Returns the column labels of the DataFrame.

property shape#: Returns a tuple representing the dimensionality of the DataFrame.

iloc(index: Union[int, slice, List[int]]) → Partition[源代码]#

Integer-location based indexing for selection by position.

参数: index (Union[int, slice, List[int]]) – rows index.
返回: Selected DataFrame.
返回类型: Partition

drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') → Optional[Partition][源代码]#: See pandas.DataFrame.drop

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) → Optional[Partition][源代码]#: See pandas.DataFrame.fillna()

rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') → Optional[Partition][源代码]#: See pandas.DataFrame.rename()

value_counts(*args, **kwargs) → Partition[源代码]#: Return a Series containing counts of unique values.

to_csv(filepath, **kwargs)[源代码]#: Save DataFrame to csv file.

copy()[源代码]#: Shallow copy.

__init__(data: Optional[PYUObject] = None) → None#

secretflow.data.math_utils module#

Functions:

`sum_of_difference_squares`(y1, y2)
`mean_of_difference_squares`(y1, y2)
`sum_of_difference_abs`(y1, y2)
`mean_of_difference_abs`(y1, y2)
`mean_of_difference_ratio_abs`(y1, y2)
`sum_of_difference_ratio_abs`(y1, y2)

secretflow.data.math_utils.sum_of_difference_squares(y1, y2)[源代码]#

secretflow.data.math_utils.mean_of_difference_squares(y1, y2)[源代码]#

secretflow.data.math_utils.sum_of_difference_abs(y1, y2)[源代码]#

secretflow.data.math_utils.mean_of_difference_abs(y1, y2)[源代码]#

secretflow.data.math_utils.mean_of_difference_ratio_abs(y1, y2)[源代码]#

secretflow.data.math_utils.sum_of_difference_ratio_abs(y1, y2)[源代码]#

secretflow.data.ndarray module#

Classes:

`PartitionWay`(value)	The partitioning.
`FedNdarray`(partitions, partition_way)	Horizontal or vertical partitioned Ndarray.

Functions:

`subtract`(y1, y2[, spu_device])	subtraction of two FedNdarray object
`load`(sources[, partition_way, allow_pickle, ...])	Load FedNdarray from data source.
`train_test_split`(data, ratio[, ...])	Split data into train and test dataset.
`shuffle`(data)	Random shuffle data.
`check_same_partition_shapes`(a1, a2)
`unary_op`(handle_function, ...[, spu_device, ...])
`mean`(y[, spu_device])	Mean of all elements :param y: FedNdarray :param spu_device: SPU
`binary_op`(handle_function, ...[, spu_device])
`get_concat_axis`(y)
`rss`(y1, y2[, spu_device])	Residual Sum of Squares of all elements
`tss`(y[, spu_device])	Total Sum of Square (Variance) of all elements
`mean_squared_error`(y_true, y_pred[, spu_device])	Mean Squared Error of all elements
`root_mean_squared_error`(y_true, y_pred[, ...])	Root Mean Squared Error of all elements
`mean_abs_err`(y_true, y_pred[, spu_device])	Mean Absolute Error
`mean_abs_percent_err`(y_true, y_pred[, ...])	Mean Absolute Percentage Error
`r2_score`(y_true, y_pred[, spu_device])	R2 Score
`histogram`(y[, bins, spu_device])	Histogram of all elements a restricted version of the counterpart in numpy
`residual_histogram`(y1, y2[, bins, spu_device])	Histogram of residuals of y1 - y2

class secretflow.data.ndarray.PartitionWay(value)[源代码]#

基类：Enum

The partitioning. HORIZONTAL: horizontal partitioning. VERATICAL: vertical partitioning.

Attributes:

`HORIZONTAL`
`VERTICAL`

HORIZONTAL = 'horizontal'#

VERTICAL = 'vertical'#

class secretflow.data.ndarray.FedNdarray(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay)[源代码]#

基类：object

Horizontal or vertical partitioned Ndarray.

partitions#

List of references to local numpy.ndarray that makes up federated ndarray.

Type: Dict[PYU, PYUObject]

Attributes:

`partitions`
`partition_way`
`shape`	Get shape of united ndarray.

Methods:

`partition_shape`()	Get ndarray shapes of all partitions.
`partition_size`()	Get ndarray sizes of all partitions.
`astype`(dtype[, order, casting, subok, copy])	Cast to a specified type.
`__init__`(partitions, partition_way)

partitions: Dict[PYU, PYUObject]#

partition_way: PartitionWay#

partition_shape()[源代码]#: Get ndarray shapes of all partitions.

partition_size()[源代码]#: Get ndarray sizes of all partitions.

property shape: Tuple[int, int]#: Get shape of united ndarray.

astype(dtype, order='K', casting='unsafe', subok=True, copy=True)[源代码]#

Cast to a specified type.

All args are same with numpy.ndarray.astype().

__init__(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay) → None#

secretflow.data.ndarray.subtract(y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

subtraction of two FedNdarray object

返回: result of subtraction

as long as they have the same shape, the result is computable. They may have different partition shapes.

secretflow.data.ndarray.load(sources: Dict[PYU, Union[str, Callable[[], ndarray], PYUObject]], partition_way: PartitionWay = PartitionWay.VERTICAL, allow_pickle=False, encoding='ASCII') → FedNdarray[源代码]#

Load FedNdarray from data source.

警告

Loading files that contain object arrays uses the pickle module, which is not secure against erroneous or maliciously constructed data. Consider passing allow_pickle=False to load data that is known not to contain object arrays for the safer handling of untrusted sources.

参数

sources – Data source in each partition. Shall be one of the followings. 1) Loaded numpy.ndarray. 2) Local filepath which should be .npy or .npz file. 3) Callable function that return numpy.ndarray.
allow_pickle – Allow loading pickled object arrays stored in npy files.
encoding – What encoding to use when reading Python 2 strings.

抛出

TypeError – illegal source。

返回

Returns a FedNdarray if source is pyu object or .npy. Or return a dict {key: FedNdarray} if source is .npz.

示例

>>> fed_arr = load({'alice': 'example/alice.csv', 'bob': 'example/alice.csv'})

secretflow.data.ndarray.train_test_split(data: FedNdarray, ratio: float, random_state: Optional[int] = None, shuffle=True) → Tuple[FedNdarray, FedNdarray][源代码]#

Split data into train and test dataset.

参数

data – Data to split.
ratio – Train dataset ratio.
random_state – Controls the shuffling applied to the data before applying the split.
shuffle – Whether or not to shuffle the data before splitting.

返回

Tuple of train and test dataset.

secretflow.data.ndarray.shuffle(data: FedNdarray)[源代码]#

Random shuffle data.

参数: data – data to be shuffled.

secretflow.data.ndarray.check_same_partition_shapes(a1: FedNdarray, a2: FedNdarray)[源代码]#

secretflow.data.ndarray.unary_op(handle_function: Callable, len_1_handle_function: Callable, y: FedNdarray, spu_device: Optional[SPU] = None, simulate_double_value_replacer_handle: Optional[Callable] = None)[源代码]#

secretflow.data.ndarray.mean(y: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Mean of all elements :param y: FedNdarray :param spu_device: SPU

If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then

If y is empty return 0.

secretflow.data.ndarray.binary_op(handle_function: Callable, len_1_handle_function: Callable, y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

secretflow.data.ndarray.get_concat_axis(y: FedNdarray) → int[源代码]#

secretflow.data.ndarray.rss(y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Residual Sum of Squares of all elements

more detail for rss: https://en.wikipedia.org/wiki/Residual_sum_of_squares

参数

y1 – FedNdarray
y2 – FedNdarray
spu_device – SPU

y1 and y2 must have the same device and partition shapes

If y1 is from a single party, then a PYUObject is returned. If y1 is from multiple parties, then

If y1 is empty return 0.

secretflow.data.ndarray.tss(y: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Total Sum of Square (Variance) of all elements

more detail for tss: https://en.wikipedia.org/wiki/Total_sum_of_squares

参数: y – FedNdarray

If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then

If y is empty return 0.

secretflow.data.ndarray.mean_squared_error(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Mean Squared Error of all elements

more detail for mse: https://en.wikipedia.org/wiki/Mean_squared_error

参数

y_true – FedNdarray
y_pred – FedNdarray
spu_device – SPU

y_true and y_pred must have the same device and partition shapes

If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then

If y_true is empty return 0.

secretflow.data.ndarray.root_mean_squared_error(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Root Mean Squared Error of all elements

more detail for mse: https://en.wikipedia.org/wiki/Root-mean-square_deviation

参数

y_true – FedNdarray
y_pred – FedNdarray
spu_device – SPU

y_true and y_pred must have the same device and partition shapes

If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then

If y_true is empty return 0.

secretflow.data.ndarray.mean_abs_err(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Mean Absolute Error

more detail for mean abs err: https://en.wikipedia.org/wiki/Mean_absolute_error

参数

y_true – FedNdarray
y_pred – FedNdarray

y_true and y_pred must have the same device and partition shapes

If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then

If y_true is empty return 0.

secretflow.data.ndarray.mean_abs_percent_err(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

Mean Absolute Percentage Error

more detail for mean percent err: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

参数

y_true – FedNdarray
y_pred – FedNdarray

y_true and y_pred must have the same device and partition shapes

If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then

If y_true is empty return 0.

secretflow.data.ndarray.r2_score(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[源代码]#

R2 Score

more detail for r2 score: https://en.wikipedia.org/wiki/Coefficient_of_determination

参数

y_true – FedNdarray
y_pred – FedNdarray

y_true and y_pred must have the same device and partition shapes

If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then

If y_true is empty return 0.

secretflow.data.ndarray.histogram(y: FedNdarray, bins: int = 10, spu_device: Optional[SPU] = None)[源代码]#

Histogram of all elements a restricted version of the counterpart in numpy

more detail for histogram: https://numpy.org/doc/stable/reference/generated/numpy.histogram.html

参数: y – FedNdarray

If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then

secretflow.data.ndarray.residual_histogram(y1: FedNdarray, y2: FedNdarray, bins: int = 10, spu_device: Optional[SPU] = None)[源代码]#

Histogram of residuals of y1 - y2

Support histogram(y1 - y2) equivalent function even if y1 and y2 have distinct partition shapes.

参数

y1 – FedNdarray
y2 – FedNdarray

If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then

secretflow.data.split module#

Functions:

train_test_split(data[, test_size, ...])

Split data into train and test dataset.

secretflow.data.split.train_test_split(data: Union[VDataFrame, HDataFrame, FedNdarray], test_size=None, train_size=None, random_state=1234, shuffle=True, stratify=None) → Tuple[object, object][源代码]#

Split data into train and test dataset.

参数

data – DataFrame to split, supported are: VDataFrame,HDataFrame,FedNdarray.
test_size (float) – test dataset size, default is None.
train_size (float) – train dataset size, default is None.
random_state (int) – Controls the shuffling applied to the data before applying the split.
shuffle (bool) – Whether or not to shuffle the data before splitting, default is True.
stratify (array-like) – If not None, data is split in a stratified fashion, using this as the class labels.

Returns: splitting : list, length=2 * len(arrays)

示例

>>> import numpy as np
>>> from secret.data.split import train_test_split
>>> # FedNdarray
>>> alice_arr = alice(lambda: np.array([[1, 2, 3], [4, 5, 6]]))()
>>> bob_arr = bob(lambda: np.array([[11, 12, 13], [14, 15, 16]]))()

>>> fed_arr = load({self.alice: alice_arr, self.bob: bob_arr})
>>>
>>> X_train, X_test = train_test_split(
...  fed_arr, test_size=0.33, random_state=42)
...
>>> VDataFrame
>>> df_alice = pd.DataFrame({'a1': ['K5', 'K1', None, 'K6'],
...                          'a2': ['A5', 'A1', 'A2', 'A6'],
...                          'a3': [5, 1, 2, 6]})

>>> df_bob = pd.DataFrame({'b4': [10.2, 20.5, None, -0.4],
...                        'b5': ['B3', None, 'B9', 'B4'],
...                        'b6': [3, 1, 9, 4]})
>>> df_alice = df_alice
>>> df_bob = df_bob
>>> vdf = VDataFrame(
...       {alice: Partition(data=cls.alice(lambda: df_alice)()),
...          bob: Partition(data=cls.bob(lambda: df_bob)())})
>>> train_vdf, test_vdf = train_test_split(vdf, test_size=0.33, random_state=42)

Module contents#

Classes:

`FedNdarray`(partitions, partition_way)	Horizontal or vertical partitioned Ndarray.
`PartitionWay`(value)	The partitioning.

class secretflow.data.FedNdarray(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay)[源代码]#

基类：object

Horizontal or vertical partitioned Ndarray.

partitions#

List of references to local numpy.ndarray that makes up federated ndarray.

Type: Dict[PYU, PYUObject]

Attributes:

`partitions`
`partition_way`
`shape`	Get shape of united ndarray.

Methods:

`partition_shape`()	Get ndarray shapes of all partitions.
`partition_size`()	Get ndarray sizes of all partitions.
`astype`(dtype[, order, casting, subok, copy])	Cast to a specified type.
`__init__`(partitions, partition_way)

partitions: Dict[PYU, PYUObject]#

partition_way: PartitionWay#

partition_shape()[源代码]#: Get ndarray shapes of all partitions.

partition_size()[源代码]#: Get ndarray sizes of all partitions.

property shape: Tuple[int, int]#: Get shape of united ndarray.

astype(dtype, order='K', casting='unsafe', subok=True, copy=True)[源代码]#

Cast to a specified type.

All args are same with numpy.ndarray.astype().

__init__(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay) → None#

class secretflow.data.PartitionWay(value)[源代码]#

基类：Enum

The partitioning. HORIZONTAL: horizontal partitioning. VERATICAL: vertical partitioning.

Attributes:

`HORIZONTAL`
`VERTICAL`

HORIZONTAL = 'horizontal'#

VERTICAL = 'vertical'#

secretflow.data package#

Subpackages#

Submodules#

secretflow.data.base module#

secretflow.data.math_utils module#

secretflow.data.ndarray module#

secretflow.data.split module#

Module contents#