在 SecretFlow 中加载 Numpy 数据#
下述代码仅作为展示用例。考虑到系统安全因素,其 不适用于生产环境 。请 不要 在生产实践中直接使用。
本教程将展示如何在 SecretFlow 的多方安全环境中加载 Numpy 数据。
SecretFlow 支持
npy, npz 等多种格式,接口封装和 numpy 保持一致。环境设置#
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import secretflow as sf
# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))
# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=True)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')
2023-04-17 16:47:03,538 INFO worker.py:1538 -- Started a local Ray instance.
接口介绍#
我们在 SecretFlow 中提供了类似于 numpy.load 的接口 secretflow.data.ndarray.load 来将各方数据的 ndarray 读取成为一个联邦概念的数据。
通过 secretflow.data.load 可以读取多方的 Numpy 文件,构成一个 FedNdarray 。
接口介绍:secretflow.data.load
数据下载与分割#
[3]:
%%capture
%%!
wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz
pip install opencv-python
E0417 16:47:08.224327619 13264 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
[4]:
import numpy as np
all_data = np.load("./mnist.npz")
对数据进行拆分
[5]:
alice_train_x = all_data["x_train"][:30000]
alice_test_x = all_data["x_test"][:30000]
alice_train_y = all_data["y_train"][:30000]
alice_test_y = all_data["y_test"][:30000]
bob_train_x = all_data["x_train"][30000:]
bob_test_x = all_data["x_test"][30000:]
bob_train_y = all_data["y_train"][30000:]
bob_test_y = all_data["y_test"][30000:]
分别保存成 npz 格式文件用
[6]:
np.savez(
"./alice_mnist.npz",
train_x=alice_train_x,
test_x=alice_test_x,
train_y=alice_train_y,
test_y=alice_test_y,
)
np.savez(
"./bob_mnist.npz",
train_x=bob_train_x,
test_x=bob_test_x,
train_y=bob_train_y,
test_y=bob_test_y,
)
将 Alice 和 Bob 的 train_x 保存成 npy 格式,方便后文 npy 格式读取使用。
[7]:
np.save("./alice_mnist_train_x.npy", alice_train_x)
np.save("./bob_mnist_train_x.npy", bob_train_x)
加载 npz 文件#
[8]:
alice_path = "./alice_mnist.npz"
bob_path = "./bob_mnist.npz"
[9]:
from secretflow.data.ndarray import load
from secretflow.data.split import train_test_split
[10]:
fed_npz = load({alice: alice_path, bob: bob_path}, allow_pickle=True)
[11]:
fed_npz
[11]:
{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}
FedNpz 的每一个 value 是 FedNdarray。
[12]:
type(fed_npz["train_x"])
[12]:
secretflow.data.ndarray.FedNdarray
加载 npy 文件#
加载 npy 就很简单了,直接调用 load 接口读取出来就是一个标准的 FedNdarray 对象。
[13]:
alice_path = "./alice_mnist_train_x.npy"
bob_path = "./bob_mnist_train_x.npy"
[14]:
fed_ndarray = load({alice: alice_path, bob: bob_path}, allow_pickle=True)
[15]:
type(fed_ndarray)
[15]:
secretflow.data.ndarray.FedNdarray
应该怎样将我已有的数据转成 FedNdarray 再进行读取?#
那我们怎么样将其他类型的数据转成 FedNdarray 数据呢?
比如我有一个图像数据集或者语音数据集,我该怎么样通过 FedNdarray 将数据传入联邦模型?
我们这里以花卉分类数据集 Flower 来举个例子。
[16]:
import tempfile
import tensorflow as tf
_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
"flower_photos",
"https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
untar=True,
cache_dir=_temp_dir,
)
2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:32.352111: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228813984/228813984 [==============================] - 2s 0us/step
下载解压后根目录存在 Root = “flower_photos”
[17]:
import os, glob
import numpy as np
import cv2 # The dependencies need to be installed manually, pip install opencv-python
root = path_to_flower_dataset
classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
img_paths = [] # Used to save all picture paths
labels = [] # Used to save the picture category tags,(0,1,2,3,4)
for i, label in enumerate(classes):
cls_img_paths = glob.glob(os.path.join(root, label, "*.jpg"))
img_paths.extend(cls_img_paths)
labels.extend([i] * len(cls_img_paths))
# image->numpy
img_numpys = []
labels = np.array(labels)
for img_path in img_paths:
img_numpy = cv2.imread(img_path)
img_numpy = cv2.resize(img_numpy, (240, 240))
img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))
# If use Pytorch backend dimension should be exchanged
# img_numpy = np.transpose(img_numpy, (0,3,1,2))
img_numpys.append(img_numpy)
images = np.concatenate(img_numpys, axis=0)
print(images.shape)
print(labels.shape)
# Distribute images and labels to two nodes, allocating 50% of the data to each node.
per = 0.5
alice_images = images[: int(per * images.shape[0]), :, :, :]
alice_label = labels[: int(per * images.shape[0])]
bob_images = images[int(per * images.shape[0]) :, :, :, :]
bob_label = labels[int(per * images.shape[0]) :]
print(
f"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}"
)
print(f"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}")
# Save the data as npz files separately, and then send them to the two machines.
np.savez("flower_alice.npz", image=alice_images, label=alice_label)
np.savez("flower_bob.npz", image=bob_images, label=bob_label)
(3670, 240, 240, 3)
(3670,)
alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)
bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)
得到需要的 NPZ 之后,使用上面介绍过的 Load 函数读取成 FedNdarray,输入到模型中即可开始训练。
[18]:
fed_flower_npz = load(
{alice: "./flower_alice.npz", bob: "./flower_bob.npz"}, allow_pickle=True
)
[19]:
fed_flower_npz
[19]:
{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}
[20]:
fed_image = fed_flower_npz["image"]
[21]:
fed_image.partition_shape()
[21]:
{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}
小建议#
建议将数据转为 ndarray 类型之后,使用单机版训练引擎进行测试,检查数据格式是否正确匹配模型。然后再使用 SecretFlow(隐语) 的联邦框架进行测试,可以提高排查效率。
注意:在使用图像数据集是要注意维度排列。