在 SecretFlow 中加载 Numpy 数据#

下述代码仅作为展示用例。考虑到系统安全因素，其 不适用于生产环境 。请不要在生产实践中直接使用。

本教程将展示如何在 SecretFlow 的多方安全环境中加载 Numpy 数据。

SecretFlow 支持 npy, npz 等多种格式，接口封装和 numpy 保持一致。

环境设置#

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=True)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')

2023-04-17 16:47:03,538 INFO worker.py:1538 -- Started a local Ray instance.

接口介绍#

我们在 SecretFlow 中提供了类似于 numpy.load 的接口 secretflow.data.ndarray.load 来将各方数据的 ndarray 读取成为一个联邦概念的数据。

通过 secretflow.data.load 可以读取多方的 Numpy 文件，构成一个 FedNdarray 。

接口介绍：secretflow.data.load

数据下载与分割#

[3]:

%%capture
%%!
wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz
pip install opencv-python

E0417 16:47:08.224327619   13264 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers

[4]:

import numpy as np

all_data = np.load("./mnist.npz")

对数据进行拆分

[5]:

alice_train_x = all_data["x_train"][:30000]
alice_test_x = all_data["x_test"][:30000]
alice_train_y = all_data["y_train"][:30000]
alice_test_y = all_data["y_test"][:30000]

bob_train_x = all_data["x_train"][30000:]
bob_test_x = all_data["x_test"][30000:]
bob_train_y = all_data["y_train"][30000:]
bob_test_y = all_data["y_test"][30000:]

分别保存成 npz 格式文件用

[6]:

np.savez(
    "./alice_mnist.npz",
    train_x=alice_train_x,
    test_x=alice_test_x,
    train_y=alice_train_y,
    test_y=alice_test_y,
)
np.savez(
    "./bob_mnist.npz",
    train_x=bob_train_x,
    test_x=bob_test_x,
    train_y=bob_train_y,
    test_y=bob_test_y,
)

将 Alice 和 Bob 的 train_x 保存成 npy 格式，方便后文 npy 格式读取使用。

[7]:

np.save("./alice_mnist_train_x.npy", alice_train_x)
np.save("./bob_mnist_train_x.npy", bob_train_x)

加载 npz 文件#

[8]:

alice_path = "./alice_mnist.npz"
bob_path = "./bob_mnist.npz"

[9]:

from secretflow.data.ndarray import load
from secretflow.data.split import train_test_split

[10]:

fed_npz = load({alice: alice_path, bob: bob_path}, allow_pickle=True)

[11]:

fed_npz

[11]:

{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

FedNpz 的每一个 value 是 FedNdarray。

[12]:

type(fed_npz["train_x"])

[12]:

secretflow.data.ndarray.FedNdarray

加载 npy 文件#

加载 npy 就很简单了，直接调用 load 接口读取出来就是一个标准的 FedNdarray 对象。

[13]:

alice_path = "./alice_mnist_train_x.npy"
bob_path = "./bob_mnist_train_x.npy"

[14]:

fed_ndarray = load({alice: alice_path, bob: bob_path}, allow_pickle=True)

[15]:

type(fed_ndarray)

[15]:

secretflow.data.ndarray.FedNdarray

应该怎样将我已有的数据转成 FedNdarray 再进行读取？#

那我们怎么样将其他类型的数据转成 FedNdarray 数据呢？
比如我有一个图像数据集或者语音数据集，我该怎么样通过 FedNdarray 将数据传入联邦模型？
我们这里以花卉分类数据集 Flower 来举个例子。

[16]:

import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)

2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352111: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228813984/228813984 [==============================] - 2s 0us/step

下载解压后根目录存在 Root = “flower_photos”

[17]:

import os, glob
import numpy as np
import cv2  # The dependencies need to be installed manually, pip install opencv-python

root = path_to_flower_dataset
classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
img_paths = []  # Used to save all picture paths
labels = []  # Used to save the picture category tags,(0,1,2,3,4)
for i, label in enumerate(classes):
    cls_img_paths = glob.glob(os.path.join(root, label, "*.jpg"))
    img_paths.extend(cls_img_paths)
    labels.extend([i] * len(cls_img_paths))

# image->numpy
img_numpys = []
labels = np.array(labels)
for img_path in img_paths:
    img_numpy = cv2.imread(img_path)
    img_numpy = cv2.resize(img_numpy, (240, 240))
    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))
    # If use Pytorch backend dimension should be exchanged
    # img_numpy = np.transpose(img_numpy, (0,3,1,2))
    img_numpys.append(img_numpy)

images = np.concatenate(img_numpys, axis=0)
print(images.shape)
print(labels.shape)

# Distribute images and labels to two nodes, allocating 50% of the data to each node.
per = 0.5
alice_images = images[: int(per * images.shape[0]), :, :, :]
alice_label = labels[: int(per * images.shape[0])]
bob_images = images[int(per * images.shape[0]) :, :, :, :]
bob_label = labels[int(per * images.shape[0]) :]
print(
    f"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}"
)
print(f"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}")

# Save the data as npz files separately, and then send them to the two machines.
np.savez("flower_alice.npz", image=alice_images, label=alice_label)
np.savez("flower_bob.npz", image=bob_images, label=bob_label)

(3670, 240, 240, 3)
(3670,)
alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)
bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)

得到需要的 NPZ 之后，使用上面介绍过的 Load 函数读取成 FedNdarray，输入到模型中即可开始训练。

[18]:

fed_flower_npz = load(
    {alice: "./flower_alice.npz", bob: "./flower_bob.npz"}, allow_pickle=True
)

[19]:

fed_flower_npz

[19]:

{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

[20]:

fed_image = fed_flower_npz["image"]

[21]:

fed_image.partition_shape()

[21]:

{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}

小建议#

建议将数据转为 ndarray 类型之后，使用单机版训练引擎进行测试，检查数据格式是否正确匹配模型。然后再使用 SecretFlow(隐语) 的联邦框架进行测试，可以提高排查效率。

注意：在使用图像数据集是要注意维度排列。