{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0b6b37c9",
   "metadata": {},
   "source": [
    "# Using Custom DataBuilder in SecretFlow (TensorFlow)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e57a23",
   "metadata": {},
   "source": [
    "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8528a86e",
   "metadata": {},
   "source": [
    "In this tutorial, we will show you how to load data and train model using the custom DataBuilder schema in the multi-party secure environment of SecretFlow.\n",
    "This tutorial will use the image classification task of the Flower dataset to introduce, how to use the custom DataBuilder to complete federated learning.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b03e55b2",
   "metadata": {},
   "source": [
    "## Environment Setting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c08ecd6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "812c6aea",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-17 15:18:33,602\tINFO worker.py:1538 -- Started a local Ray instance.\n"
     ]
    }
   ],
   "source": [
    "import secretflow as sf\n",
    "\n",
    "# Check the version of your SecretFlow\n",
    "print('The version of SecretFlow: {}'.format(sf.__version__))\n",
    "\n",
    "# In case you have a running secretflow runtime already.\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=False)\n",
    "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25c38ba4",
   "metadata": {},
   "source": [
    "## Interface Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "509e5461",
   "metadata": {},
   "source": [
    "We support custom DataBuilder reads in SecretFlow's `FLModel` to make it easier for users to handle data inputs more flexibly according to their needs.\n",
    "Let's use an example to demonstrate how to use the custom DataBuilder for federated model training.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff157baf",
   "metadata": {},
   "source": [
    "Steps to use DataBuilder:\n",
    "\n",
    "1. Use the single-machine version engine (TensorFlow, PyTorch) to develop and get the Builder function of the Dataset.\n",
    "2. Wrap the Builder functions of each party to get `create_dataset_builder` function. *Note: The dataset_builder needs to pass in the stage parameter.*\n",
    "3. Build the data_builder_dict [PYU, dataset_builder].\n",
    "4. Pass the obtained data_builder_dict to the `dataset_builder` of the `fit` function. At the same time, the x parameter position is passed into the required input in dataset_builder (eg: the input passed in this example is the actual image path used)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fad75fd6",
   "metadata": {},
   "source": [
    "Using DataBuilder in FLModel requires a pre-defined `data_builder_dict`. Need to be able to return `tf.dataset` and `steps_per_epoch`. And the steps_per_epoch returned by all parties must be consistent.\n",
    "```python\n",
    "data_builder_dict = \n",
    "        {\n",
    "            alice: create_alice_dataset_builder(\n",
    "                batch_size=32,\n",
    "            ), # create_alice_dataset_builder must return (Dataset, steps_per_epoch)\n",
    "            bob: create_bob_dataset_builder(\n",
    "                batch_size=32,\n",
    "            ), # create_bob_dataset_builder must return (Dataset, steps_per_epochstep_per_epochs)\n",
    "        }\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c55b0faf",
   "metadata": {},
   "source": [
    "## Download Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07ffc09e",
   "metadata": {},
   "source": [
    "Flower Dataset Introduction: The Flower dataset consists of 4323 color images of 5 different types of flowers (daisy, dandelion, rose, sunflower, and tulip). Each flower has images from multiple angles and different lighting conditions, and the resolution of each image is 320x240.\n",
    "This dataset is commonly used for training and testing of image classification and machine learning algorithms. The number of each category in the dataset is as follows: daisy (633), dandelion (898), rose (641), sunflower (699), and tulip (852).\n",
    "\n",
    "Download link: [http://download.tensorflow.org/example_images/flower_photos.tgz](http://download.tensorflow.org/example_images/flower_photos.tgz)\n",
    "\n",
    "<img alt=\"flower_dataset_demo.png\" src=\"resources/flower_dataset_demo.png\" width=\"600\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef1f5d14",
   "metadata": {},
   "source": [
    "### Download Data and Unzip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ff9720cd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data from https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz\n",
      "67588319/67588319 [==============================] - 1s 0us/step\n"
     ]
    }
   ],
   "source": [
    "import tempfile\n",
    "import tensorflow as tf\n",
    "\n",
    "_temp_dir = tempfile.mkdtemp()\n",
    "path_to_flower_dataset = tf.keras.utils.get_file(\n",
    "    \"flower_photos\",\n",
    "    \"https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz\",\n",
    "    untar=True,\n",
    "    cache_dir=_temp_dir,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d6f1057",
   "metadata": {},
   "source": [
    "Next let's start building a custom DataBuilder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83d98ab3",
   "metadata": {},
   "source": [
    "## 1. Develop DataBuilder with single-machine version engine"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c81834d",
   "metadata": {},
   "source": [
    "When we develop DataBuilder, we are free to follow the logic of single-machine development.\n",
    "The purpose is to build a `tf.dataset` object.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "dfe48fe7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 436 files belonging to 5 classes.\n",
      "Using 349 files for training.\n",
      "Using 87 files for validation.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-04-10 13:16:34.492390: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "import tensorflow as tf\n",
    "\n",
    "img_height = 180\n",
    "img_width = 180\n",
    "batch_size = 32\n",
    "# In this example, we use the TensorFlow interface for development.\n",
    "data_set = tf.keras.utils.image_dataset_from_directory(\n",
    "    path_to_flower_dataset,\n",
    "    validation_split=0.2,\n",
    "    subset=\"both\",\n",
    "    seed=123,\n",
    "    image_size=(img_height, img_width),\n",
    "    batch_size=batch_size,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8106f107",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_set = data_set[0]\n",
    "test_set = data_set[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "37dd53fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>\n"
     ]
    }
   ],
   "source": [
    "print(type(train_set), type(test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f4199b24",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x.shape = (32, 180, 180, 3)\n",
      "y.shape = (32,)\n"
     ]
    }
   ],
   "source": [
    "x, y = next(iter(train_set))\n",
    "print(f\"x.shape = {x.shape}\")\n",
    "print(f\"y.shape = {y.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "020e0c15",
   "metadata": {},
   "source": [
    "## 2. Wrap the developed DataBuilder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfbb0e6d",
   "metadata": {},
   "source": [
    "The DataBuilder we developed needs to be distributed to each execution machine for execution, and we need to wrap them in order to serialize.\n",
    "Note: **FLModel requires the incoming DataBuilder return two results (data_set, steps_per_epoch).**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d36bee75",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_dataset_builder(\n",
    "    batch_size=32,\n",
    "):\n",
    "    def dataset_builder(folder_path, stage=\"train\"):\n",
    "        import math\n",
    "\n",
    "        import tensorflow as tf\n",
    "\n",
    "        img_height = 180\n",
    "        img_width = 180\n",
    "        data_set = tf.keras.utils.image_dataset_from_directory(\n",
    "            folder_path,\n",
    "            validation_split=0.2,\n",
    "            subset=\"both\",\n",
    "            seed=123,\n",
    "            image_size=(img_height, img_width),\n",
    "            batch_size=batch_size,\n",
    "        )\n",
    "        if stage == \"train\":\n",
    "            train_dataset = data_set[0]\n",
    "            train_step_per_epoch = math.ceil(len(data_set[0].file_paths) / batch_size)\n",
    "            return train_dataset, train_step_per_epoch\n",
    "        elif stage == \"eval\":\n",
    "            eval_dataset = data_set[1]\n",
    "            eval_step_per_epoch = math.ceil(len(data_set[1].file_paths) / batch_size)\n",
    "            return eval_dataset, eval_step_per_epoch\n",
    "\n",
    "    return dataset_builder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "198128c5",
   "metadata": {},
   "source": [
    "## 3. Build dataset_builder_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a73a04",
   "metadata": {},
   "source": [
    "In the horizontal scenario, the logic for all parties to process data is the same, so we only need a wrapped DataBuilder construction method.\n",
    "Next we build the `dataset_builder_dict`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c051b566",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_builder_dict = {\n",
    "    alice: create_dataset_builder(\n",
    "        batch_size=32,\n",
    "    ),\n",
    "    bob: create_dataset_builder(\n",
    "        batch_size=32,\n",
    "    ),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8e36c7d",
   "metadata": {},
   "source": [
    "## 4. After get dataset_builder_dict, we can pass it into the model for use"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5602707d",
   "metadata": {},
   "source": [
    "Next we define the model and use the custom data constructed above for training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "feea334e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_conv_flower_model(input_shape, num_classes, name='model'):\n",
    "    def create_model():\n",
    "        from tensorflow import keras\n",
    "\n",
    "        # Create model\n",
    "\n",
    "        model = keras.Sequential(\n",
    "            [\n",
    "                keras.Input(shape=input_shape),\n",
    "                tf.keras.layers.Rescaling(1.0 / 255),\n",
    "                tf.keras.layers.Conv2D(32, 3, activation='relu'),\n",
    "                tf.keras.layers.MaxPooling2D(),\n",
    "                tf.keras.layers.Conv2D(32, 3, activation='relu'),\n",
    "                tf.keras.layers.MaxPooling2D(),\n",
    "                tf.keras.layers.Conv2D(32, 3, activation='relu'),\n",
    "                tf.keras.layers.MaxPooling2D(),\n",
    "                tf.keras.layers.Flatten(),\n",
    "                tf.keras.layers.Dense(128, activation='relu'),\n",
    "                tf.keras.layers.Dense(num_classes),\n",
    "            ]\n",
    "        )\n",
    "        # Compile model\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n",
    "            optimizer='adam',\n",
    "            metrics=[\"accuracy\"],\n",
    "        )\n",
    "        return model\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "60414cb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from secretflow.ml.nn import FLModel\n",
    "from secretflow.security.aggregation import SecureAggregator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f368538f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party alice.\n",
      "INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party bob.\n",
      "INFO:root:Create proxy actor <class 'secretflow.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party alice.\n",
      "INFO:root:Create proxy actor <class 'secretflow.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party bob.\n"
     ]
    }
   ],
   "source": [
    "device_list = [alice, bob]\n",
    "aggregator = SecureAggregator(charlie, [alice, bob])\n",
    "\n",
    "# prepare model\n",
    "num_classes = 5\n",
    "input_shape = (180, 180, 3)\n",
    "\n",
    "# keras model\n",
    "model = create_conv_flower_model(input_shape, num_classes)\n",
    "\n",
    "\n",
    "fed_model = FLModel(\n",
    "    device_list=device_list,\n",
    "    model=model,\n",
    "    aggregator=aggregator,\n",
    "    backend=\"tensorflow\",\n",
    "    strategy=\"fed_avg_w\",\n",
    "    random_seed=1234,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32d97af9",
   "metadata": {},
   "source": [
    "The input of our constructed dataset builder is the path of the image dataset, so we need to set the input data as a `Dict` here.\n",
    "```python\n",
    "data = {\n",
    "    alice: folder_path_of_alice,\n",
    "    bob: folder_path_of_bob\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "de4b659a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:FL Train Params: {'self': <secretflow.ml.nn.fl.fl_model.FLModel object at 0x7f7b7a28b8e0>, 'x': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'y': None, 'batch_size': 32, 'batch_sampling_rate': None, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'shuffle': False, 'class_weight': None, 'sample_weight': None, 'validation_freq': 1, 'aggregate_freq': 2, 'label_decoder': None, 'max_batch_size': 20000, 'prefetch_buffer_size': None, 'sampler_method': 'batch', 'random_seed': 1234, 'dp_spent_step_freq': 1, 'audit_log_dir': None, 'dataset_builder': {alice: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb1f0>, bob: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb0d0>}}\n",
      "32it [00:18,  1.71it/s, epoch: 1/5 -  loss:1.5339548587799072  accuracy:0.3142559826374054  val_loss:1.582740068435669  val_accuracy:0.2874999940395355 ]\n",
      "100%|██████████| 8/8 [00:05<00:00,  1.51it/s, epoch: 2/5 -  loss:1.4520319700241089  accuracy:0.36693549156188965  val_loss:1.3319271802902222  val_accuracy:0.40416666865348816 ]\n",
      "100%|██████████| 8/8 [00:05<00:00,  1.54it/s, epoch: 3/5 -  loss:1.2720597982406616  accuracy:0.45766130089759827  val_loss:1.3382091522216797  val_accuracy:0.47083333134651184 ]\n",
      "100%|██████████| 8/8 [00:05<00:00,  1.50it/s, epoch: 4/5 -  loss:1.229131817817688  accuracy:0.5040322542190552  val_loss:1.3033963441848755  val_accuracy:0.4375 ]\n",
      "100%|██████████| 8/8 [00:05<00:00,  1.59it/s, epoch: 5/5 -  loss:1.3306885957717896  accuracy:0.4301075339317322  val_loss:2.1492652893066406  val_accuracy:0.25833332538604736 ]\n"
     ]
    }
   ],
   "source": [
    "data = {\n",
    "    alice: path_to_flower_dataset,\n",
    "    bob: path_to_flower_dataset,\n",
    "}\n",
    "history = fed_model.fit(\n",
    "    data,\n",
    "    None,\n",
    "    validation_data=data,\n",
    "    epochs=5,\n",
    "    batch_size=32,\n",
    "    aggregate_freq=2,\n",
    "    sampler_method=\"batch\",\n",
    "    random_seed=1234,\n",
    "    dp_spent_step_freq=1,\n",
    "    dataset_builder=data_builder_dict,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54ff96a9",
   "metadata": {},
   "source": [
    "Next, you can use your own dataset to try"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}