PSI On SPU#

The following codes are demos only. It’s NOT for production due to system security concerns, please DO NOT use it directly in production.

PSI(Private Set Intersection) is a cryptographic technique that allows two parties holding sets to compare encrypted versions of these sets in order to compute the intersection. In this scenario, neither party reveals anything to the counterparty except for the elements in the intersection.

In SecretFlow, SPU device supports three PSI protocol:

  • ECDH:semi-honest, based on public key encryption, suitable for small datasets.

  • KKRT:semi-host, based on cuckoo hashing and OT extension, suitable for large datasets.

  • BC22PCG: semi-host, psi from pseudorandom correlation generators.

Before we start, we need to initialize the environment. The following three nodes alice, bob, and carol are created on a single machine to simulate multiple participants.

[1]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob', 'carol'], num_cpus=8, log_to_driver=False)

Prepraring dataset#

First, we need a dataset for constructing vertical partitioned scenarios. For simplicity, we use iris dataset here. We add two columns to it for subsequent single-column and multi-column intersection demonstrations

  • uid:Sample unique ID.

  • month:Simulate a scenario where samples are generated monthly. The first 50% of the samples are generated in January, and the last 50% of the samples are generated in February.

[2]:
import numpy as np
from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)
data['uid'] = np.arange(len(data)).astype('str')
data['month'] = ['Jan'] * 75 + ['Feb'] * 75

data
[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) uid month
0 5.1 3.5 1.4 0.2 0 Jan
1 4.9 3.0 1.4 0.2 1 Jan
2 4.7 3.2 1.3 0.2 2 Jan
3 4.6 3.1 1.5 0.2 3 Jan
4 5.0 3.6 1.4 0.2 4 Jan
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 145 Feb
146 6.3 2.5 5.0 1.9 146 Feb
147 6.5 3.0 5.2 2.0 147 Feb
148 6.2 3.4 5.4 2.3 148 Feb
149 5.9 3.0 5.1 1.8 149 Feb

150 rows × 6 columns

In the actual scenario, the sample data is provided by each participant, and the fields for intersection need to be agreed in advance:

  • The intersection field can be single or multiple.

  • The intersection field must be unique. If there is a duplicate, it needs to be deduplicated in advance.

For example, The following is the data provided by alice for PSI intersection, the intersection field is uid and month,we can see that [1, ‘Jan’] is duplicated.

alice.csv
---------
uid   month   c0
1     Jan     5.8
2     Jan     5.4
1     Jan     5.8
1     Feb     7.4

The data after deduplication is

alice.csv
---------
uid   month   c0
1     Jan     5.8
2     Jan     5.4
1     Feb     7.4

We randomly sample the iris data three times to simulate the data provided by alice, bob, and carol, and the three data are in an unaligned state.

[3]:
import os

os.makedirs('.data', exist_ok=True)
da, db, dc = data.sample(frac=0.9), data.sample(frac=0.8), data.sample(frac=0.7)

da.to_csv('.data/alice.csv', index=False)
db.to_csv('.data/bob.csv', index=False)
dc.to_csv('.data/carol.csv', index=False)

Two parties PSI#

We virtualize three logical devices on the physical device:

  • alice, bob: PYU device, responsible for the local plaintext computation of the participant.

  • spu:SPUdevice, consists of alice and bob, responsible for the ciphertext calculation of the two parties.

[4]:
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))

Single-column PSI#

Next, we use uid to intersect the two data, SPU provide psi_csv which take the csv file as input and generate the csv file after the intersection. The default protocol is KKRT.

[5]:
input_path = {alice: '.data/alice.csv', bob: '.data/bob.csv'}
output_path = {alice: '.data/alice_psi.csv', bob: '.data/bob_psi.csv'}
spu.psi_csv('uid', input_path, output_path, 'alice')

To check the correctness of the results, we use pandas.DataFrame.join to inner join da and db. It can be seen that the two data have been aligned according to uid and sorted according to their lexicographical order.

[6]:
import pandas as pd

df = da.join(db.set_index('uid'), on='uid', how='inner', rsuffix='_bob', sort=True)
expected = df[da.columns].astype({'uid': 'int64'}).reset_index(drop=True)

da_psi = pd.read_csv('.data/alice_psi.csv')
db_psi = pd.read_csv('.data/bob_psi.csv')

pd.testing.assert_frame_equal(da_psi, expected)
pd.testing.assert_frame_equal(db_psi, expected)

da_psi
[6]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) uid month
0 6.3 3.3 6.0 2.5 100 Feb
1 5.8 2.7 5.1 1.9 101 Feb
2 7.1 3.0 5.9 2.1 102 Feb
3 6.3 2.9 5.6 1.8 103 Feb
4 6.5 3.0 5.8 2.2 104 Feb
... ... ... ... ... ... ...
101 5.6 2.7 4.2 1.3 94 Feb
102 5.7 2.9 4.2 1.3 96 Feb
103 6.2 2.9 4.3 1.3 97 Feb
104 5.1 2.5 3.0 1.1 98 Feb
105 5.7 2.8 4.1 1.3 99 Feb

106 rows × 6 columns

Multi-columns PSI#

We can also use multiple fields to intersect, the following demonstrates the use of uid and month to intersect two data. In terms of implementation, multiple fields are concatenated into a string, so please ensure that there is no duplication of the multi-column composite primary key.

[7]:
spu.psi_csv(['uid', 'month'], input_path, output_path, 'alice')

Similarly, we use pandas.DataFrame.join to verify the correctness of the result, we can see that the two data have been aligned according to uid and month, and sorted according to their lexicographical order.

[8]:
df = da.join(db.set_index(['uid', 'month']), on=['uid', 'month'], how='inner', rsuffix='_bob', sort=True)
expected = df[da.columns].astype({'uid': 'int64'}).reset_index(drop=True)

da_psi = pd.read_csv('.data/alice_psi.csv')
db_psi = pd.read_csv('.data/bob_psi.csv')

pd.testing.assert_frame_equal(da_psi, expected)
pd.testing.assert_frame_equal(db_psi, expected)

Three parties PSI#

Next, we add a third-party carol, and create a PYU device for it, as well as a SPU device built by the third party.

[9]:
carol = sf.PYU('carol')
spu_3pc = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob', 'carol']))

Then, use uid and month as the composite primary key to perform a three-way negotiation. It should be noted that the three-way negotiation only supports the ECDH protocol for the time being.

[10]:
input_path = {alice: '.data/alice.csv', bob: '.data/bob.csv', carol: '.data/carol.csv'}
output_path = {alice: '.data/alice_psi.csv', bob: '.data/bob_psi.csv', carol: '.data/carol_psi.csv'}
spu_3pc.psi_csv(['uid', 'month'], input_path, output_path, 'alice', protocol='ECDH_PSI_3PC')

Similarly, we use pandas.DataFrame.join to verify the correctness of the result.

[11]:
keys = ['uid', 'month']
df = da.join(db.set_index(keys), on=keys, how='inner', rsuffix='_bob', sort=True).join(
    dc.set_index(keys), on=keys, how='inner', rsuffix='_carol', sort=True)
expected = df[da.columns].astype({'uid': 'int64'}).reset_index(drop=True)

da_psi = pd.read_csv('.data/alice_psi.csv')
db_psi = pd.read_csv('.data/bob_psi.csv')
dc_psi = pd.read_csv('.data/carol_psi.csv')

pd.testing.assert_frame_equal(da_psi, expected)
pd.testing.assert_frame_equal(db_psi, expected)
pd.testing.assert_frame_equal(dc_psi, expected)

What’s Next#

OK! Through this tutorial, we have seen how to do two-party and three-party data intersections via SPU. After completing the data intersection, we can perform machine learning modeling on the aligned dataset.