secretflow.data.horizontal package#

Submodules#

secretflow.data.horizontal.dataframe module#

Classes:

HDataFrame(partitions, ...)

Federated dataframe holds horizontal partitioned data.

class secretflow.data.horizontal.dataframe.HDataFrame(partitions: ~typing.Dict[~secretflow.device.device.pyu.PYU, ~secretflow.data.base.Partition] = <factory>, aggregator: ~typing.Optional[~secretflow.security.aggregation.aggregator.Aggregator] = None, comparator: ~typing.Optional[~secretflow.security.compare.comparator.Comparator] = None)[source]#

Bases: DataFrameBase

Federated dataframe holds horizontal partitioned data.

This dataframe is design to provide a federated pandas dataframe and just same as using pandas. The original data is still stored locally in the data holder and is not transmitted out of the domain during all the methods execution.

In some methods we need to compute the global statistics, e.g. global maximum is needed when call max method. A aggregator or comparator is expected here for global sum or extreme value respectively.

partitions#

a dict of pyu and partition.

Type

Dict[secretflow.device.device.pyu.PYU, secretflow.data.base.Partition]

aggregator#

the aggagator for computing global values such as mean.

Type

secretflow.security.aggregation.aggregator.Aggregator

comparator#

the comparator for computing global values such as maximum/minimum.

Type

secretflow.security.compare.comparator.Comparator

Examples

>>> from secretflow.data.horizontal import read_csv
>>> from secretflow.security.aggregation import PlainAggregator, PlainComparator
>>> from secretflow import PYU
>>> alice = PYU('alice')
>>> bob = PYU('bob')
>>> h_df = read_csv({alice: 'alice.csv', bob: 'bob.csv'},
                    aggregator=PlainAggregagor(alice),
                    comparator=PlainComparator(alice))
>>> h_df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
>>> h_df.mean(numeric_only=True)
sepal_length    5.827693
sepal_width     3.054000
petal_length    3.730000
petal_width     1.198667
dtype: float64
>>> h_df.min(numeric_only=True)
sepal_length    4.3
sepal_width     2.0
petal_length    1.0
petal_width     0.1
dtype: float64
>>> h_df.max(numeric_only=True)
sepal_length    7.9
sepal_width     4.4
petal_length    6.9
petal_width     2.5
dtype: float64
>>> h_df.count()
sepal_length    130
sepal_width     150
petal_length    120
petal_width     150
class           150
dtype: int64
>>> h_df.fillna({'sepal_length': 2})

Attributes:

partitions

aggregator

comparator

values

Return a federated Numpy representation of the DataFrame.

dtypes

Return the dtypes in the DataFrame.

columns

The column labels of the DataFrame.

shape

Return a tuple representing the dimensionality of the DataFrame.

Methods:

mean(*args, **kwargs)

Return the mean of the values over the requested axis.

min(*args, **kwargs)

Return the min of the values over the requested axis.

max(*args, **kwargs)

Return the max of the values over the requested axis.

sum(*args, **kwargs)

Return the sum of the values over the requested axis.

count(*args, **kwargs)

Count non-NA cells for each column or row.

isna()

Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

quantile([q, axis])

kurtosis(*args, **kwargs)

skew(*args, **kwargs)

sem(*args, **kwargs)

std(*args, **kwargs)

var(*args, **kwargs)

replace(*args, **kwargs)

mode(*args, **kwargs)

astype(dtype[, copy, errors])

Cast object to a specified dtype dtype.

partition_shape()

Return shapes of each partition.

copy()

Shallow copy of this dataframe.

drop([labels, axis, index, columns, level, ...])

Drop specified labels from rows or columns.

fillna([value, method, axis, inplace, ...])

Fill NA/NaN values using the specified method.

to_csv(fileuris, **kwargs)

Write object to a comma-separated values (csv) file.

__init__([partitions, aggregator, comparator])

partitions: Dict[PYU, Partition]#
aggregator: Aggregator = None#
comparator: Comparator = None#
mean(*args, **kwargs) Series[source]#

Return the mean of the values over the requested axis.

All arguments are same with pandas.DataFrame.mean().

Returns

pd.Series

min(*args, **kwargs) Series[source]#

Return the min of the values over the requested axis.

All arguments are same with pandas.DataFrame.min().

Returns

pd.Series

max(*args, **kwargs) Series[source]#

Return the max of the values over the requested axis.

All arguments are same with pandas.DataFrame.max().

Returns

pd.Series

sum(*args, **kwargs) Series[source]#

Return the sum of the values over the requested axis.

All arguments are same with pandas.DataFrame.sum().

Returns

pd.Series

count(*args, **kwargs) Series[source]#

Count non-NA cells for each column or row.

All arguments are same with pandas.DataFrame.count().

Returns

pd.Series

isna() HDataFrame[source]#

Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns

DataFrame: Mask of bool values for each element in DataFrame

that indicates whether an element is an NA value.

Returns

VDataFrame

Reference:

pd.DataFrame.isna

quantile(q=0.5, axis=0)[source]#
kurtosis(*args, **kwargs)[source]#
skew(*args, **kwargs)[source]#
sem(*args, **kwargs)[source]#
std(*args, **kwargs)[source]#
var(*args, **kwargs)[source]#
replace(*args, **kwargs)[source]#
mode(*args, **kwargs)[source]#
property values: FedNdarray#

Return a federated Numpy representation of the DataFrame.

Returns

FedNdarray.

property dtypes: Series#

Return the dtypes in the DataFrame.

Returns

the data type of each column.

Return type

pd.Series

astype(dtype, copy: bool = True, errors: str = 'raise')[source]#

Cast object to a specified dtype dtype.

All args are same as pandas.DataFrame.astype().

property columns#

The column labels of the DataFrame.

property shape#

Return a tuple representing the dimensionality of the DataFrame.

partition_shape()[source]#

Return shapes of each partition.

Returns

shape}

Return type

a dict of {pyu

copy() HDataFrame[source]#

Shallow copy of this dataframe.

Returns

HDataFrame.

drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') Optional[HDataFrame][source]#

Drop specified labels from rows or columns.

All arguments are same with pandas.DataFrame.drop().

Returns

HDataFrame without the removed index or column labels or None if inplace=True.

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Optional[HDataFrame][source]#

Fill NA/NaN values using the specified method.

All arguments are same with pandas.DataFrame.fillna().

Returns

HDataFrame with missing values filled or None if inplace=True.

to_csv(fileuris: Dict[PYU, str], **kwargs)[source]#

Write object to a comma-separated values (csv) file.

Parameters
  • fileuris – a dict of file uris specifying file for each PYU.

  • kwargs – other arguments are same with pandas.DataFrame.to_csv().

Returns

Returns a list of PYUObjects whose value is none. You can use secretflow.wait to wait for the save to complete.

__init__(partitions: ~typing.Dict[~secretflow.device.device.pyu.PYU, ~secretflow.data.base.Partition] = <factory>, aggregator: ~typing.Optional[~secretflow.security.aggregation.aggregator.Aggregator] = None, comparator: ~typing.Optional[~secretflow.security.compare.comparator.Comparator] = None) None#

secretflow.data.horizontal.io module#

Functions:

read_csv(filepath[, aggregator, comparator])

Read a comma-separated values (csv) file into HDataFrame.

to_csv(df, file_uris, **kwargs)

Write object to a comma-separated values (csv) file.

secretflow.data.horizontal.io.read_csv(filepath: Dict[PYU, str], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, **kwargs) HDataFrame[source]#

Read a comma-separated values (csv) file into HDataFrame.

Parameters
  • filepath – a dict {PYU: file path}.

  • aggregator – optionla; the aggregator assigned to the dataframe.

  • comparator – optionla; the comparator assigned to the dataframe.

  • kwargs – all other arguments are same with pandas.DataFrame.read_csv().

Returns

HDataFrame

Examples

>>> read_csv({PYU('alice'): 'alice.csv', PYU('bob'): 'bob.csv'})
secretflow.data.horizontal.io.to_csv(df: HDataFrame, file_uris: Dict[PYU, str], **kwargs)[source]#

Write object to a comma-separated values (csv) file.

Parameters
  • df – the HDataFrame to save.

  • file_uris – the file path of each PYU.

  • kwargs – all other arguments are same with pandas.DataFrame.to_csv().

secretflow.data.horizontal.sampler module#

Classes:

PoissonDataSampler(x, y, s_w, sampling_rate, ...)

Generates data with poisson sampling

class secretflow.data.horizontal.sampler.PoissonDataSampler(x, y, s_w, sampling_rate, **kwargs)[source]#

Bases: Sequence

Generates data with poisson sampling

Methods:

__init__(x, y, s_w, sampling_rate, **kwargs)

Initialization

set_random_seed(random_seed)

__init__(x, y, s_w, sampling_rate, **kwargs)[source]#

Initialization

set_random_seed(random_seed)[source]#

Module contents#

Classes:

HDataFrame(partitions, ...)

Federated dataframe holds horizontal partitioned data.

Functions:

read_csv(filepath[, aggregator, comparator])

Read a comma-separated values (csv) file into HDataFrame.

to_csv(df, file_uris, **kwargs)

Write object to a comma-separated values (csv) file.

class secretflow.data.horizontal.HDataFrame(partitions: ~typing.Dict[~secretflow.device.device.pyu.PYU, ~secretflow.data.base.Partition] = <factory>, aggregator: ~typing.Optional[~secretflow.security.aggregation.aggregator.Aggregator] = None, comparator: ~typing.Optional[~secretflow.security.compare.comparator.Comparator] = None)[source]#

Bases: DataFrameBase

Federated dataframe holds horizontal partitioned data.

This dataframe is design to provide a federated pandas dataframe and just same as using pandas. The original data is still stored locally in the data holder and is not transmitted out of the domain during all the methods execution.

In some methods we need to compute the global statistics, e.g. global maximum is needed when call max method. A aggregator or comparator is expected here for global sum or extreme value respectively.

partitions#

a dict of pyu and partition.

Type

Dict[secretflow.device.device.pyu.PYU, secretflow.data.base.Partition]

aggregator#

the aggagator for computing global values such as mean.

Type

secretflow.security.aggregation.aggregator.Aggregator

comparator#

the comparator for computing global values such as maximum/minimum.

Type

secretflow.security.compare.comparator.Comparator

Examples

>>> from secretflow.data.horizontal import read_csv
>>> from secretflow.security.aggregation import PlainAggregator, PlainComparator
>>> from secretflow import PYU
>>> alice = PYU('alice')
>>> bob = PYU('bob')
>>> h_df = read_csv({alice: 'alice.csv', bob: 'bob.csv'},
                    aggregator=PlainAggregagor(alice),
                    comparator=PlainComparator(alice))
>>> h_df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
>>> h_df.mean(numeric_only=True)
sepal_length    5.827693
sepal_width     3.054000
petal_length    3.730000
petal_width     1.198667
dtype: float64
>>> h_df.min(numeric_only=True)
sepal_length    4.3
sepal_width     2.0
petal_length    1.0
petal_width     0.1
dtype: float64
>>> h_df.max(numeric_only=True)
sepal_length    7.9
sepal_width     4.4
petal_length    6.9
petal_width     2.5
dtype: float64
>>> h_df.count()
sepal_length    130
sepal_width     150
petal_length    120
petal_width     150
class           150
dtype: int64
>>> h_df.fillna({'sepal_length': 2})

Attributes:

partitions

aggregator

comparator

values

Return a federated Numpy representation of the DataFrame.

dtypes

Return the dtypes in the DataFrame.

columns

The column labels of the DataFrame.

shape

Return a tuple representing the dimensionality of the DataFrame.

Methods:

mean(*args, **kwargs)

Return the mean of the values over the requested axis.

min(*args, **kwargs)

Return the min of the values over the requested axis.

max(*args, **kwargs)

Return the max of the values over the requested axis.

sum(*args, **kwargs)

Return the sum of the values over the requested axis.

count(*args, **kwargs)

Count non-NA cells for each column or row.

isna()

Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

quantile([q, axis])

kurtosis(*args, **kwargs)

skew(*args, **kwargs)

sem(*args, **kwargs)

std(*args, **kwargs)

var(*args, **kwargs)

replace(*args, **kwargs)

mode(*args, **kwargs)

astype(dtype[, copy, errors])

Cast object to a specified dtype dtype.

partition_shape()

Return shapes of each partition.

copy()

Shallow copy of this dataframe.

drop([labels, axis, index, columns, level, ...])

Drop specified labels from rows or columns.

fillna([value, method, axis, inplace, ...])

Fill NA/NaN values using the specified method.

to_csv(fileuris, **kwargs)

Write object to a comma-separated values (csv) file.

__init__([partitions, aggregator, comparator])

partitions: Dict[PYU, Partition]#
aggregator: Aggregator = None#
comparator: Comparator = None#
mean(*args, **kwargs) Series[source]#

Return the mean of the values over the requested axis.

All arguments are same with pandas.DataFrame.mean().

Returns

pd.Series

min(*args, **kwargs) Series[source]#

Return the min of the values over the requested axis.

All arguments are same with pandas.DataFrame.min().

Returns

pd.Series

max(*args, **kwargs) Series[source]#

Return the max of the values over the requested axis.

All arguments are same with pandas.DataFrame.max().

Returns

pd.Series

sum(*args, **kwargs) Series[source]#

Return the sum of the values over the requested axis.

All arguments are same with pandas.DataFrame.sum().

Returns

pd.Series

count(*args, **kwargs) Series[source]#

Count non-NA cells for each column or row.

All arguments are same with pandas.DataFrame.count().

Returns

pd.Series

isna() HDataFrame[source]#

Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns

DataFrame: Mask of bool values for each element in DataFrame

that indicates whether an element is an NA value.

Returns

VDataFrame

Reference:

pd.DataFrame.isna

quantile(q=0.5, axis=0)[source]#
kurtosis(*args, **kwargs)[source]#
skew(*args, **kwargs)[source]#
sem(*args, **kwargs)[source]#
std(*args, **kwargs)[source]#
var(*args, **kwargs)[source]#
replace(*args, **kwargs)[source]#
mode(*args, **kwargs)[source]#
property values: FedNdarray#

Return a federated Numpy representation of the DataFrame.

Returns

FedNdarray.

property dtypes: Series#

Return the dtypes in the DataFrame.

Returns

the data type of each column.

Return type

pd.Series

astype(dtype, copy: bool = True, errors: str = 'raise')[source]#

Cast object to a specified dtype dtype.

All args are same as pandas.DataFrame.astype().

property columns#

The column labels of the DataFrame.

property shape#

Return a tuple representing the dimensionality of the DataFrame.

partition_shape()[source]#

Return shapes of each partition.

Returns

shape}

Return type

a dict of {pyu

copy() HDataFrame[source]#

Shallow copy of this dataframe.

Returns

HDataFrame.

drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') Optional[HDataFrame][source]#

Drop specified labels from rows or columns.

All arguments are same with pandas.DataFrame.drop().

Returns

HDataFrame without the removed index or column labels or None if inplace=True.

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Optional[HDataFrame][source]#

Fill NA/NaN values using the specified method.

All arguments are same with pandas.DataFrame.fillna().

Returns

HDataFrame with missing values filled or None if inplace=True.

to_csv(fileuris: Dict[PYU, str], **kwargs)[source]#

Write object to a comma-separated values (csv) file.

Parameters
  • fileuris – a dict of file uris specifying file for each PYU.

  • kwargs – other arguments are same with pandas.DataFrame.to_csv().

Returns

Returns a list of PYUObjects whose value is none. You can use secretflow.wait to wait for the save to complete.

__init__(partitions: ~typing.Dict[~secretflow.device.device.pyu.PYU, ~secretflow.data.base.Partition] = <factory>, aggregator: ~typing.Optional[~secretflow.security.aggregation.aggregator.Aggregator] = None, comparator: ~typing.Optional[~secretflow.security.compare.comparator.Comparator] = None) None#
secretflow.data.horizontal.read_csv(filepath: Dict[PYU, str], aggregator: Optional[Aggregator] = None, comparator: Optional[Comparator] = None, **kwargs) HDataFrame[source]#

Read a comma-separated values (csv) file into HDataFrame.

Parameters
  • filepath – a dict {PYU: file path}.

  • aggregator – optionla; the aggregator assigned to the dataframe.

  • comparator – optionla; the comparator assigned to the dataframe.

  • kwargs – all other arguments are same with pandas.DataFrame.read_csv().

Returns

HDataFrame

Examples

>>> read_csv({PYU('alice'): 'alice.csv', PYU('bob'): 'bob.csv'})
secretflow.data.horizontal.to_csv(df: HDataFrame, file_uris: Dict[PYU, str], **kwargs)[source]#

Write object to a comma-separated values (csv) file.

Parameters
  • df – the HDataFrame to save.

  • file_uris – the file path of each PYU.

  • kwargs – all other arguments are same with pandas.DataFrame.to_csv().