{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9b399a34",
   "metadata": {},
   "source": [
    "# Load Numpy data in SecretFlow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11b8eb49",
   "metadata": {},
   "source": [
    "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bbd8384",
   "metadata": {},
   "source": [
    "This tutorial will demonstrate how to load Numpy data in a multi-party secure environment using SecretFlow.  \n",
    "SecretFlow supports multiple formats, including `.npy` and `.npz`, and its interface is designed to be compatible with `numpy` "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8383ac24",
   "metadata": {},
   "source": [
    "## Environment Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8ee31441",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bb86c839",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-17 16:47:03,538\tINFO worker.py:1538 -- Started a local Ray instance.\n"
     ]
    }
   ],
   "source": [
    "import secretflow as sf\n",
    "\n",
    "# Check the version of your SecretFlow\n",
    "print('The version of SecretFlow: {}'.format(sf.__version__))\n",
    "\n",
    "# In case you have a running secretflow runtime already.\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=True)\n",
    "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e4f9827",
   "metadata": {},
   "source": [
    "## Interface Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf743273",
   "metadata": {},
   "source": [
    "In SecretFlow, we provide an interface similar to `numpy.load` called `secretflow.load.ndarray.load` to load `ndarray` data from multiple parties and convert it into a federated representation. \n",
    "\n",
    " Using secretflow.data.load, you can read numpy files from multiple parties and create a `FedNdarray` object.\n",
    "\n",
    "Interface Introduction：[secretflow.data.load](https://www.secretflow.org.cn/docs/secretflow/en/source/secretflow.data.html#secretflow.data.ndarray.load)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "947c30c9",
   "metadata": {},
   "source": [
    "## Data Download and Splitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8b53809f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E0417 16:47:08.224327619   13264 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\n"
     ]
    }
   ],
   "source": [
    "%%capture\n",
    "%%!\n",
    "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz\n",
    "pip install opencv-python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fb1ca464",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "all_data = np.load(\"./mnist.npz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20ccb301",
   "metadata": {},
   "source": [
    "Splitting the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0ff318ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_train_x = all_data[\"x_train\"][:30000]\n",
    "alice_test_x = all_data[\"x_test\"][:30000]\n",
    "alice_train_y = all_data[\"y_train\"][:30000]\n",
    "alice_test_y = all_data[\"y_test\"][:30000]\n",
    "\n",
    "bob_train_x = all_data[\"x_train\"][30000:]\n",
    "bob_test_x = all_data[\"x_test\"][30000:]\n",
    "bob_train_y = all_data[\"y_train\"][30000:]\n",
    "bob_test_y = all_data[\"y_test\"][30000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41652edb",
   "metadata": {},
   "source": [
    "Saving separately as npz format file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a8ece9aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.savez(\n",
    "    \"./alice_mnist.npz\",\n",
    "    train_x=alice_train_x,\n",
    "    test_x=alice_test_x,\n",
    "    train_y=alice_train_y,\n",
    "    test_y=alice_test_y,\n",
    ")\n",
    "np.savez(\n",
    "    \"./bob_mnist.npz\",\n",
    "    train_x=bob_train_x,\n",
    "    test_x=bob_test_x,\n",
    "    train_y=bob_train_y,\n",
    "    test_y=bob_test_y,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a4ab72d",
   "metadata": {},
   "source": [
    "Saving tarin_x from Alice and Bob as npy format for convenient future reading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0d29231e",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.save(\"./alice_mnist_train_x.npy\", alice_train_x)\n",
    "np.save(\"./bob_mnist_train_x.npy\", bob_train_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd135e73",
   "metadata": {},
   "source": [
    "##  Loading npz files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1e8e7578",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_path = \"./alice_mnist.npz\"\n",
    "bob_path = \"./bob_mnist.npz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "67fc91ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "from secretflow.data.ndarray import load\n",
    "from secretflow.data.split import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "bcca6eff",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_npz = load({alice: alice_path, bob: bob_path}, allow_pickle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a66aa798",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_npz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cde8331",
   "metadata": {},
   "source": [
    "In FedNpz, each value represents a FedNdarray."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f9e378ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "secretflow.data.ndarray.FedNdarray"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(fed_npz[\"train_x\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e044e91",
   "metadata": {},
   "source": [
    "## Loading npy files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "166a2c68",
   "metadata": {},
   "source": [
    "Loading npy is very simple. Directly call the load interface, and the results will be a standard FedNdarray object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c718b1c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "alice_path = \"./alice_mnist_train_x.npy\"\n",
    "bob_path = \"./bob_mnist_train_x.npy\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1fb4fe11",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_ndarray = load({alice: alice_path, bob: bob_path}, allow_pickle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0f21b86c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "secretflow.data.ndarray.FedNdarray"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(fed_ndarray)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "936ba7ff",
   "metadata": {},
   "source": [
    "##  How can I convert my existing data into a FedNdarray and read it?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "177fd75c",
   "metadata": {},
   "source": [
    "How can we convert other types of data into FedNdarray data?  \n",
    "If we have an image dataset or a speech dataset, how can we pass the data into a federated model using FedNdarray?  \n",
    "Let's take the flower classification dataset Flower as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e4158ea5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:32.352101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n",
      "2023-04-17 11:41:32.352111: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz\n",
      "228813984/228813984 [==============================] - 2s 0us/step\n"
     ]
    }
   ],
   "source": [
    "import tempfile\n",
    "import tensorflow as tf\n",
    "\n",
    "_temp_dir = tempfile.mkdtemp()\n",
    "path_to_flower_dataset = tf.keras.utils.get_file(\n",
    "    \"flower_photos\",\n",
    "    \"https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz\",\n",
    "    untar=True,\n",
    "    cache_dir=_temp_dir,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2e9c468",
   "metadata": {},
   "source": [
    "After downloading and extracting the dataset, the root directory of the dataset is \"flower_photos\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "e11441a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3670, 240, 240, 3)\n",
      "(3670,)\n",
      "alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)\n",
      "bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)\n"
     ]
    }
   ],
   "source": [
    "import os, glob\n",
    "import numpy as np\n",
    "import cv2  # The dependencies need to be installed manually, pip install opencv-python\n",
    "\n",
    "root = path_to_flower_dataset\n",
    "classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']\n",
    "img_paths = []  # Used to save all picture paths\n",
    "labels = []  # Used to save the picture category tags,(0,1,2,3,4)\n",
    "for i, label in enumerate(classes):\n",
    "    cls_img_paths = glob.glob(os.path.join(root, label, \"*.jpg\"))\n",
    "    img_paths.extend(cls_img_paths)\n",
    "    labels.extend([i] * len(cls_img_paths))\n",
    "\n",
    "# image->numpy\n",
    "img_numpys = []\n",
    "labels = np.array(labels)\n",
    "for img_path in img_paths:\n",
    "    img_numpy = cv2.imread(img_path)\n",
    "    img_numpy = cv2.resize(img_numpy, (240, 240))\n",
    "    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))\n",
    "    # If use Pytorch backend dimension should be exchanged\n",
    "    # img_numpy = np.transpose(img_numpy, (0,3,1,2))\n",
    "    img_numpys.append(img_numpy)\n",
    "\n",
    "images = np.concatenate(img_numpys, axis=0)\n",
    "print(images.shape)\n",
    "print(labels.shape)\n",
    "\n",
    "# Distribute images and labels to two nodes, allocating 50% of the data to each node.\n",
    "per = 0.5\n",
    "alice_images = images[: int(per * images.shape[0]), :, :, :]\n",
    "alice_label = labels[: int(per * images.shape[0])]\n",
    "bob_images = images[int(per * images.shape[0]) :, :, :, :]\n",
    "bob_label = labels[int(per * images.shape[0]) :]\n",
    "print(\n",
    "    f\"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}\"\n",
    ")\n",
    "print(f\"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}\")\n",
    "\n",
    "# Save the data as npz files separately, and then send them to the two machines.\n",
    "np.savez(\"flower_alice.npz\", image=alice_images, label=alice_label)\n",
    "np.savez(\"flower_bob.npz\", image=bob_images, label=bob_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "942c44d6",
   "metadata": {},
   "source": [
    " Once you have obtained the required NPZ files, use the previously mentioned load function to read them into FedNdarray format. Then, input them into the model to begin training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "fb97921c",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_flower_npz = load(\n",
    "    {alice: \"./flower_alice.npz\", bob: \"./flower_bob.npz\"}, allow_pickle=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "b851ee52",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),\n",
       " 'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_flower_npz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "0f76e617",
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_image = fed_flower_npz[\"image\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "368fa687",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fed_image.partition_shape()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d17ec85c",
   "metadata": {},
   "source": [
    "## Tips"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c2dd4a8",
   "metadata": {},
   "source": [
    "It is recommended to test the data after converting it to the ndarray type using a single-machine training engine to verify if the data format matches the model correctly. Then, you can proceed to test it using the SecretFlow federated framework, which can improve the efficiency of troubleshooting.  \n",
    "*Note: When using image datasets, it is important to pay attention to the dimension ordering.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}