Load Numpy data in SecretFlow#

The following codes are demos only. It’s NOT for production due to system security concerns, please DO NOT use it directly in production.

This tutorial will demonstrate how to load Numpy data in a multi-party secure environment using SecretFlow.
SecretFlow supports multiple formats, including .npy and .npz, and its interface is designed to be compatible with numpy

Environment Configuration#

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=True)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')
2023-04-17 16:47:03,538 INFO worker.py:1538 -- Started a local Ray instance.

Interface Introduction#

In SecretFlow, we provide an interface similar to numpy.load called secretflow.load.ndarray.load to load ndarray data from multiple parties and convert it into a federated representation.

Using secretflow.data.load, you can read numpy files from multiple parties and create a FedNdarray object.

Interface Introduction:secretflow.data.load

Data Download and Splitting#

[3]:
%%capture
%%!
wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz
pip install opencv-python
E0417 16:47:08.224327619   13264 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
[4]:
import numpy as np

all_data = np.load("./mnist.npz")

Splitting the data.

[5]:
alice_train_x = all_data["x_train"][:30000]
alice_test_x = all_data["x_test"][:30000]
alice_train_y = all_data["y_train"][:30000]
alice_test_y = all_data["y_test"][:30000]

bob_train_x = all_data["x_train"][30000:]
bob_test_x = all_data["x_test"][30000:]
bob_train_y = all_data["y_train"][30000:]
bob_test_y = all_data["y_test"][30000:]

Saving separately as npz format file.

[6]:
np.savez(
    "./alice_mnist.npz",
    train_x=alice_train_x,
    test_x=alice_test_x,
    train_y=alice_train_y,
    test_y=alice_test_y,
)
np.savez(
    "./bob_mnist.npz",
    train_x=bob_train_x,
    test_x=bob_test_x,
    train_y=bob_train_y,
    test_y=bob_test_y,
)

Saving tarin_x from Alice and Bob as npy format for convenient future reading.

[7]:
np.save("./alice_mnist_train_x.npy", alice_train_x)
np.save("./bob_mnist_train_x.npy", bob_train_x)

Loading npz files#

[8]:
alice_path = "./alice_mnist.npz"
bob_path = "./bob_mnist.npz"
[9]:
from secretflow.data.ndarray import load
from secretflow.data.split import train_test_split
[10]:
fed_npz = load({alice: alice_path, bob: bob_path}, allow_pickle=True)
[11]:
fed_npz
[11]:
{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

In FedNpz, each value represents a FedNdarray.

[12]:
type(fed_npz["train_x"])
[12]:
secretflow.data.ndarray.FedNdarray

Loading npy files#

Loading npy is very simple. Directly call the load interface, and the results will be a standard FedNdarray object.

[13]:
alice_path = "./alice_mnist_train_x.npy"
bob_path = "./bob_mnist_train_x.npy"
[14]:
fed_ndarray = load({alice: alice_path, bob: bob_path}, allow_pickle=True)
[15]:
type(fed_ndarray)
[15]:
secretflow.data.ndarray.FedNdarray

How can I convert my existing data into a FedNdarray and read it?#

How can we convert other types of data into FedNdarray data?
If we have an image dataset or a speech dataset, how can we pass the data into a federated model using FedNdarray?
Let’s take the flower classification dataset Flower as an example.
[16]:
import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)
2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352111: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228813984/228813984 [==============================] - 2s 0us/step

After downloading and extracting the dataset, the root directory of the dataset is “flower_photos”.

[17]:
import os, glob
import numpy as np
import cv2  # The dependencies need to be installed manually, pip install opencv-python

root = path_to_flower_dataset
classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
img_paths = []  # Used to save all picture paths
labels = []  # Used to save the picture category tags,(0,1,2,3,4)
for i, label in enumerate(classes):
    cls_img_paths = glob.glob(os.path.join(root, label, "*.jpg"))
    img_paths.extend(cls_img_paths)
    labels.extend([i] * len(cls_img_paths))

# image->numpy
img_numpys = []
labels = np.array(labels)
for img_path in img_paths:
    img_numpy = cv2.imread(img_path)
    img_numpy = cv2.resize(img_numpy, (240, 240))
    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))
    # If use Pytorch backend dimension should be exchanged
    # img_numpy = np.transpose(img_numpy, (0,3,1,2))
    img_numpys.append(img_numpy)

images = np.concatenate(img_numpys, axis=0)
print(images.shape)
print(labels.shape)

# Distribute images and labels to two nodes, allocating 50% of the data to each node.
per = 0.5
alice_images = images[: int(per * images.shape[0]), :, :, :]
alice_label = labels[: int(per * images.shape[0])]
bob_images = images[int(per * images.shape[0]) :, :, :, :]
bob_label = labels[int(per * images.shape[0]) :]
print(
    f"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}"
)
print(f"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}")

# Save the data as npz files separately, and then send them to the two machines.
np.savez("flower_alice.npz", image=alice_images, label=alice_label)
np.savez("flower_bob.npz", image=bob_images, label=bob_label)
(3670, 240, 240, 3)
(3670,)
alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)
bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)

Once you have obtained the required NPZ files, use the previously mentioned load function to read them into FedNdarray format. Then, input them into the model to begin training.

[18]:
fed_flower_npz = load(
    {alice: "./flower_alice.npz", bob: "./flower_bob.npz"}, allow_pickle=True
)
[19]:
fed_flower_npz
[19]:
{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}
[20]:
fed_image = fed_flower_npz["image"]
[21]:
fed_image.partition_shape()
[21]:
{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}

Tips#

It is recommended to test the data after converting it to the ndarray type using a single-machine training engine to verify if the data format matches the model correctly. Then, you can proceed to test it using the SecretFlow federated framework, which can improve the efficiency of troubleshooting.
Note: When using image datasets, it is important to pay attention to the dimension ordering.