{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Customize DataBuilder on SLModel in SecretFlow"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We support federated learning in vertical scenarios on SLModel in SecretFlow. In this tutorial, we will take DeepFM model for recommendation as an example to introduce how to customize a DataBuilder on SLModel. \n",
    "\n",
    "The main goal of this tutorial is to train DeepFM to show how to customize a DataBuilder on SLModel. \n",
    "\n",
    "*If you want to learn more about related content for DeepFM model and split plans, please move to related documents for DeepFM model.* "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Environment Setting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The version of SecretFlow: 0.8.3b0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-07-03 20:12:02,535\tINFO worker.py:1538 -- Started a local Ray instance.\n"
     ]
    }
   ],
   "source": [
    "import secretflow as sf\n",
    "\n",
    "# Check the version of your SecretFlow\n",
    "print('The version of SecretFlow: {}'.format(sf.__version__))\n",
    "\n",
    "# In case you have a running secretflow runtime already.\n",
    "sf.shutdown()\n",
    "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=False)\n",
    "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We will use the classic dataset [MovieLens](https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip) for introduction. MovieLens is an open recommendation system dataset that contains movie ratings and movie metadata information. \n",
    "\n",
    "And we split data into several pieces: \n",
    "\n",
    "- alice: \"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\" \n",
    "- bob: \"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\" \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "source": [
    "## Define DataBuilders"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We customize a DataBuilder with the aim to meet higher level customization demands, as the existing FedDataFrame and FedNdarray did not provide the required functionality. \n",
    "\n",
    "In Split Learning(where all sides of the data need to be aligned), we only provide DataBuilder for CSV data for the time being. You can define what to do with each row of data in the custom DataBuilder. The operations in SLModel depend on the way you define it. \n",
    "\n",
    "The things you should notice here: \n",
    "\n",
    "- We read MovieLens dataset in CSV format. However, it needs to be treated as sparse features. The DataBuilder will convert each column into the form of dictionary. \n",
    "- Bob's score was processed into binary form using threshold. \n",
    "- Custom functions can be defined in the DataBuilder and then applied to each column using dataset.map. \n",
    "- Only CSV format is supported in the vertical scenario due to the restriction that both sides of the data should be aligned. \n",
    "- It returns Dataset which has been defined Batch_size and Repeat in the CSV mode so that SLModel can infer the Steps_per_epoch according to the Dataset. \n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Download the dataset and convert it to the CSV format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "%%!\n",
    "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip\n",
    "unzip ./ml-1m.zip "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the data from the dat format and convert it to the dictionary form. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_data(filename, columns):\n",
    "    data = {}\n",
    "    with open(filename, \"r\", encoding=\"unicode_escape\") as f:\n",
    "        for line in f:\n",
    "            ls = line.strip(\"\\n\").split(\"::\")\n",
    "            data[ls[0]] = dict(zip(columns[1:], ls[1:]))\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "users_data = load_data(\n",
    "    \"./ml-1m/users.dat\",\n",
    "    columns=[\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n",
    ")\n",
    "movies_data = load_data(\"./ml-1m/movies.dat\", columns=[\"MovieID\", \"Title\", \"Genres\"])\n",
    "ratings_columns = [\"UserID\", \"MovieID\", \"Rating\", \"Timestamp\"]\n",
    "\n",
    "rating_data = load_data(\"./ml-1m/ratings.dat\", columns=ratings_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6040\n",
      "3883\n",
      "6040\n"
     ]
    }
   ],
   "source": [
    "print(len(users_data))\n",
    "print(len(movies_data))\n",
    "print(len(rating_data))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we join user, movie and rating, split up and assemble them into 'alice_ml1m.csv' and 'bob_ml1m.csv'. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "fed_csv = {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}\n",
    "csv_writer_container = {alice: open(fed_csv[alice], \"w\"), bob: open(fed_csv[bob], \"w\")}\n",
    "part_columns = {\n",
    "    alice: [\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n",
    "    bob: [\"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "for device, writer in csv_writer_container.items():\n",
    "    writer.write(\"ID,\" + \",\".join(part_columns[device]) + \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = open(\"ml-1m/ratings.dat\", \"r\", encoding=\"unicode_escape\")\n",
    "\n",
    "\n",
    "def _parse_example(feature, columns, index):\n",
    "    if \"Title\" in feature.keys():\n",
    "        feature[\"Title\"] = feature[\"Title\"].replace(\",\", \"_\")\n",
    "    if \"Genres\" in feature.keys():\n",
    "        feature[\"Genres\"] = feature[\"Genres\"].replace(\"|\", \" \")\n",
    "    values = []\n",
    "    values.append(str(index))\n",
    "    for c in columns:\n",
    "        values.append(feature[c])\n",
    "    return \",\".join(values)\n",
    "\n",
    "\n",
    "index = 0\n",
    "num_sample = 1000\n",
    "for line in f:\n",
    "    ls = line.strip().split(\"::\")\n",
    "    rating = dict(zip(ratings_columns, ls))\n",
    "    rating.update(users_data.get(ls[0]))\n",
    "    rating.update(movies_data.get(ls[1]))\n",
    "    for device, columns in part_columns.items():\n",
    "        parse_f = _parse_example(rating, columns, index)\n",
    "        csv_writer_container[device].write(parse_f + \"\\n\")\n",
    "    index += 1\n",
    "    if num_sample > 0 and index >= num_sample:\n",
    "        break\n",
    "for w in csv_writer_container.values():\n",
    "    w.close()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## So we've done the data processing and splitting up by now. \n",
    "\n",
    "And we output two files:  \n",
    "```\n",
    "Alice: alice_ml1m.csv and Bob: bob_ml1m.csv \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID,UserID,Gender,Age,Occupation,Zip-code\r\n",
      "0,1,F,1,10,48067\r\n",
      "1,1,F,1,10,48067\r\n",
      "2,1,F,1,10,48067\r\n",
      "3,1,F,1,10,48067\r\n",
      "4,1,F,1,10,48067\r\n",
      "5,1,F,1,10,48067\r\n",
      "6,1,F,1,10,48067\r\n",
      "7,1,F,1,10,48067\r\n",
      "8,1,F,1,10,48067\r\n"
     ]
    }
   ],
   "source": [
    "! head alice_ml1m.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID,MovieID,Rating,Title,Genres,Timestamp\r\n",
      "0,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,978300760\r\n",
      "1,661,3,James and the Giant Peach (1996),Animation Children's Musical,978302109\r\n",
      "2,914,3,My Fair Lady (1964),Musical Romance,978301968\r\n",
      "3,3408,4,Erin Brockovich (2000),Drama,978300275\r\n",
      "4,2355,5,Bug's Life_ A (1998),Animation Children's Comedy,978824291\r\n",
      "5,1197,3,Princess Bride_ The (1987),Action Adventure Comedy Romance,978302268\r\n",
      "6,1287,5,Ben-Hur (1959),Action Adventure Drama,978302039\r\n",
      "7,2804,5,Christmas Story_ A (1983),Comedy Drama,978300719\r\n",
      "8,594,4,Snow White and the Seven Dwarfs (1937),Animation Children's Musical,978302268\r\n"
     ]
    }
   ],
   "source": [
    "! head bob_ml1m.csv"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Use plaintext engines to develop Databuilders. \n",
    "\n",
    "Because the data for each side of the SLModel is different, DataBuilder needs to be developed separately. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Develop Alice's DataBuilder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "alice_df = pd.read_csv(\"alice_ml1m.csv\", encoding=\"utf-8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserID</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Age</th>\n",
       "      <th>Occupation</th>\n",
       "      <th>Zip-code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>10</td>\n",
       "      <td>F</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>95370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>10</td>\n",
       "      <td>F</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>95370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>10</td>\n",
       "      <td>F</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>95370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>10</td>\n",
       "      <td>F</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>95370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>10</td>\n",
       "      <td>F</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>95370</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    UserID Gender  Age  Occupation  Zip-code\n",
       "0        1      F    1          10     48067\n",
       "1        1      F    1          10     48067\n",
       "2        1      F    1          10     48067\n",
       "3        1      F    1          10     48067\n",
       "4        1      F    1          10     48067\n",
       "..     ...    ...  ...         ...       ...\n",
       "995     10      F   35           1     95370\n",
       "996     10      F   35           1     95370\n",
       "997     10      F   35           1     95370\n",
       "998     10      F   35           1     95370\n",
       "999     10      F   35           1     95370\n",
       "\n",
       "[1000 rows x 5 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alice_df[\"UserID\"] = alice_df[\"UserID\"].astype(\"string\")\n",
    "alice_df = alice_df.drop(columns=\"ID\")\n",
    "alice_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-07-03 20:23:17.128798: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
      "2023-07-03 20:23:19.575681: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n",
      "2023-07-03 20:23:19.576067: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n",
      "2023-07-03 20:23:19.576096: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n",
      "2023-07-03 20:23:22.559302: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory\n",
      "2023-07-03 20:23:22.560651: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "alice_dict = dict(alice_df)\n",
    "data_set = tf.data.Dataset.from_tensor_slices(alice_dict).batch(32).repeat(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<RepeatDataset element_spec={'UserID': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Gender': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Age': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Occupation': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Zip-code': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_set"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Develop Bob's DataBuilder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>MovieID</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Title</th>\n",
       "      <th>Genres</th>\n",
       "      <th>Timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>One Flew Over the Cuckoo's Nest (1975)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>978300760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "      <td>James and the Giant Peach (1996)</td>\n",
       "      <td>Animation Children's Musical</td>\n",
       "      <td>978302109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "      <td>My Fair Lady (1964)</td>\n",
       "      <td>Musical Romance</td>\n",
       "      <td>978301968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "      <td>Erin Brockovich (2000)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>978300275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "      <td>Bug's Life_ A (1998)</td>\n",
       "      <td>Animation Children's Comedy</td>\n",
       "      <td>978824291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>3704</td>\n",
       "      <td>2</td>\n",
       "      <td>Mad Max Beyond Thunderdome (1985)</td>\n",
       "      <td>Action Sci-Fi</td>\n",
       "      <td>978228364</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>1020</td>\n",
       "      <td>3</td>\n",
       "      <td>Cool Runnings (1993)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>978228726</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>784</td>\n",
       "      <td>3</td>\n",
       "      <td>Cable Guy_ The (1996)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>978230946</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>858</td>\n",
       "      <td>3</td>\n",
       "      <td>Godfather_ The (1972)</td>\n",
       "      <td>Action Crime Drama</td>\n",
       "      <td>978224375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>1022</td>\n",
       "      <td>5</td>\n",
       "      <td>Cinderella (1950)</td>\n",
       "      <td>Animation Children's Musical</td>\n",
       "      <td>979775689</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    MovieID  Rating                                   Title  \\\n",
       "0      1193       5  One Flew Over the Cuckoo's Nest (1975)   \n",
       "1       661       3        James and the Giant Peach (1996)   \n",
       "2       914       3                     My Fair Lady (1964)   \n",
       "3      3408       4                  Erin Brockovich (2000)   \n",
       "4      2355       5                    Bug's Life_ A (1998)   \n",
       "..      ...     ...                                     ...   \n",
       "995    3704       2       Mad Max Beyond Thunderdome (1985)   \n",
       "996    1020       3                    Cool Runnings (1993)   \n",
       "997     784       3                   Cable Guy_ The (1996)   \n",
       "998     858       3                   Godfather_ The (1972)   \n",
       "999    1022       5                       Cinderella (1950)   \n",
       "\n",
       "                           Genres  Timestamp  \n",
       "0                           Drama  978300760  \n",
       "1    Animation Children's Musical  978302109  \n",
       "2                 Musical Romance  978301968  \n",
       "3                           Drama  978300275  \n",
       "4     Animation Children's Comedy  978824291  \n",
       "..                            ...        ...  \n",
       "995                 Action Sci-Fi  978228364  \n",
       "996                        Comedy  978228726  \n",
       "997                        Comedy  978230946  \n",
       "998            Action Crime Drama  978224375  \n",
       "999  Animation Children's Musical  979775689  \n",
       "\n",
       "[1000 rows x 5 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bob_df = pd.read_csv(\"bob_ml1m.csv\", encoding=\"utf-8\")\n",
    "bob_df = bob_df.drop(columns=\"ID\")\n",
    "\n",
    "bob_df[\"MovieID\"] = bob_df[\"MovieID\"].astype(\"string\")\n",
    "\n",
    "bob_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "label = bob_df[\"Rating\"]\n",
    "data = bob_df.drop(columns=\"Rating\")\n",
    "\n",
    "\n",
    "def _parse_bob(row_sample, label):\n",
    "    import tensorflow as tf\n",
    "\n",
    "    y_t = label\n",
    "    y = tf.expand_dims(\n",
    "        tf.where(\n",
    "            y_t > 3,\n",
    "            tf.ones_like(y_t, dtype=tf.float32),\n",
    "            tf.zeros_like(y_t, dtype=tf.float32),\n",
    "        ),\n",
    "        axis=1,\n",
    "    )\n",
    "    return row_sample, y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "bob_dict = tuple([dict(data), label])\n",
    "data_set = tf.data.Dataset.from_tensor_slices(bob_dict).batch(32).repeat(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = data_set.map(_parse_bob)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'MovieID': <tf.Tensor: shape=(32,), dtype=string, numpy=\n",
       "  array([b'1193', b'661', b'914', b'3408', b'2355', b'1197', b'1287',\n",
       "         b'2804', b'594', b'919', b'595', b'938', b'2398', b'2918', b'1035',\n",
       "         b'2791', b'2687', b'2018', b'3105', b'2797', b'2321', b'720',\n",
       "         b'1270', b'527', b'2340', b'48', b'1097', b'1721', b'1545', b'745',\n",
       "         b'2294', b'3186'], dtype=object)>,\n",
       "  'Title': <tf.Tensor: shape=(32,), dtype=string, numpy=\n",
       "  array([b\"One Flew Over the Cuckoo's Nest (1975)\",\n",
       "         b'James and the Giant Peach (1996)', b'My Fair Lady (1964)',\n",
       "         b'Erin Brockovich (2000)', b\"Bug's Life_ A (1998)\",\n",
       "         b'Princess Bride_ The (1987)', b'Ben-Hur (1959)',\n",
       "         b'Christmas Story_ A (1983)',\n",
       "         b'Snow White and the Seven Dwarfs (1937)',\n",
       "         b'Wizard of Oz_ The (1939)', b'Beauty and the Beast (1991)',\n",
       "         b'Gigi (1958)', b'Miracle on 34th Street (1947)',\n",
       "         b\"Ferris Bueller's Day Off (1986)\", b'Sound of Music_ The (1965)',\n",
       "         b'Airplane! (1980)', b'Tarzan (1999)', b'Bambi (1942)',\n",
       "         b'Awakenings (1990)', b'Big (1988)', b'Pleasantville (1998)',\n",
       "         b'Wallace & Gromit: The Best of Aardman Animation (1996)',\n",
       "         b'Back to the Future (1985)', b\"Schindler's List (1993)\",\n",
       "         b'Meet Joe Black (1998)', b'Pocahontas (1995)',\n",
       "         b'E.T. the Extra-Terrestrial (1982)', b'Titanic (1997)',\n",
       "         b'Ponette (1996)', b'Close Shave_ A (1995)', b'Antz (1998)',\n",
       "         b'Girl_ Interrupted (1999)'], dtype=object)>,\n",
       "  'Genres': <tf.Tensor: shape=(32,), dtype=string, numpy=\n",
       "  array([b'Drama', b\"Animation Children's Musical\", b'Musical Romance',\n",
       "         b'Drama', b\"Animation Children's Comedy\",\n",
       "         b'Action Adventure Comedy Romance', b'Action Adventure Drama',\n",
       "         b'Comedy Drama', b\"Animation Children's Musical\",\n",
       "         b\"Adventure Children's Drama Musical\",\n",
       "         b\"Animation Children's Musical\", b'Musical', b'Drama', b'Comedy',\n",
       "         b'Musical', b'Comedy', b\"Animation Children's\",\n",
       "         b\"Animation Children's\", b'Drama', b'Comedy Fantasy', b'Comedy',\n",
       "         b'Animation', b'Comedy Sci-Fi', b'Drama War', b'Romance',\n",
       "         b\"Animation Children's Musical Romance\",\n",
       "         b\"Children's Drama Fantasy Sci-Fi\", b'Drama Romance', b'Drama',\n",
       "         b'Animation Comedy Thriller', b\"Animation Children's\", b'Drama'],\n",
       "        dtype=object)>,\n",
       "  'Timestamp': <tf.Tensor: shape=(32,), dtype=int64, numpy=\n",
       "  array([978300760, 978302109, 978301968, 978300275, 978824291, 978302268,\n",
       "         978302039, 978300719, 978302268, 978301368, 978824268, 978301752,\n",
       "         978302281, 978302124, 978301753, 978302188, 978824268, 978301777,\n",
       "         978301713, 978302039, 978302205, 978300760, 978300055, 978824195,\n",
       "         978300103, 978824351, 978301953, 978300055, 978824139, 978824268,\n",
       "         978824291, 978300019])>},\n",
       " <tf.Tensor: shape=(32, 1), dtype=float32, numpy=\n",
       " array([[1.],\n",
       "        [0.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [0.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [1.],\n",
       "        [0.],\n",
       "        [1.],\n",
       "        [1.]], dtype=float32)>)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(iter(data_set))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Wrap their DataBuilders separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# alice\n",
    "def create_dataset_builder_alice(\n",
    "    batch_size=128,\n",
    "    repeat_count=5,\n",
    "):\n",
    "    def dataset_builder(x):\n",
    "        import pandas as pd\n",
    "        import tensorflow as tf\n",
    "\n",
    "        x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n",
    "        x = x[0] if len(x) == 1 else tuple(x)\n",
    "        data_set = (\n",
    "            tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n",
    "        )\n",
    "\n",
    "        return data_set\n",
    "\n",
    "    return dataset_builder\n",
    "\n",
    "\n",
    "# bob\n",
    "def create_dataset_builder_bob(\n",
    "    batch_size=128,\n",
    "    repeat_count=5,\n",
    "):\n",
    "    def _parse_bob(row_sample, label):\n",
    "        import tensorflow as tf\n",
    "\n",
    "        y_t = label[\"Rating\"]\n",
    "        y = tf.expand_dims(\n",
    "            tf.where(\n",
    "                y_t > 3,\n",
    "                tf.ones_like(y_t, dtype=tf.float32),\n",
    "                tf.zeros_like(y_t, dtype=tf.float32),\n",
    "            ),\n",
    "            axis=1,\n",
    "        )\n",
    "        return row_sample, y\n",
    "\n",
    "    def dataset_builder(x):\n",
    "        import pandas as pd\n",
    "        import tensorflow as tf\n",
    "\n",
    "        x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n",
    "        x = x[0] if len(x) == 1 else tuple(x)\n",
    "        data_set = (\n",
    "            tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n",
    "        )\n",
    "\n",
    "        data_set = data_set.map(_parse_bob)\n",
    "\n",
    "        return data_set\n",
    "\n",
    "    return dataset_builder"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Construct a databuilder_dict."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_builder_dict = {\n",
    "    alice: create_dataset_builder_alice(\n",
    "        batch_size=128,\n",
    "        repeat_count=5,\n",
    "    ),\n",
    "    bob: create_dataset_builder_bob(\n",
    "        batch_size=128,\n",
    "        repeat_count=5,\n",
    "    ),\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define DeepFM Model and run it"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Define a DeepFM model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "from secretflow.ml.nn.applications.sl_deep_fm import DeepFMbase, DeepFMfuse\n",
    "from secretflow.ml.nn import SLModel\n",
    "\n",
    "NUM_USERS = 6040\n",
    "NUM_MOVIES = 3952\n",
    "GENDER_VOCAB = [\"F\", \"M\"]\n",
    "AGE_VOCAB = [1, 18, 25, 35, 45, 50, 56]\n",
    "OCCUPATION_VOCAB = [i for i in range(21)]\n",
    "GENRES_VOCAB = [\n",
    "    \"Action\",\n",
    "    \"Adventure\",\n",
    "    \"Animation\",\n",
    "    \"Children's\",\n",
    "    \"Comedy\",\n",
    "    \"Crime\",\n",
    "    \"Documentary\",\n",
    "    \"Drama\",\n",
    "    \"Fantasy\",\n",
    "    \"Film-Noir\",\n",
    "    \"Horror\",\n",
    "    \"Musical\",\n",
    "    \"Mystery\",\n",
    "    \"Romance\",\n",
    "    \"Sci-Fi\",\n",
    "    \"Thriller\",\n",
    "    \"War\",\n",
    "    \"Western\",\n",
    "]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Define Alice's basenet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_base_model_alice():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        def preprocess():\n",
    "            inputs = {\n",
    "                \"UserID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Gender\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Age\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n",
    "                \"Occupation\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n",
    "            }\n",
    "            user_id_output = tf.keras.layers.Hashing(\n",
    "                num_bins=NUM_USERS, output_mode=\"one_hot\"\n",
    "            )\n",
    "            user_gender_output = tf.keras.layers.StringLookup(\n",
    "                vocabulary=GENDER_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "\n",
    "            user_age_out = tf.keras.layers.IntegerLookup(\n",
    "                vocabulary=AGE_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "            user_occupation_out = tf.keras.layers.IntegerLookup(\n",
    "                vocabulary=OCCUPATION_VOCAB, output_mode=\"one_hot\"\n",
    "            )\n",
    "\n",
    "            outputs = {\n",
    "                \"UserID\": user_id_output(inputs[\"UserID\"]),\n",
    "                \"Gender\": user_gender_output(inputs[\"Gender\"]),\n",
    "                \"Age\": user_age_out(inputs[\"Age\"]),\n",
    "                \"Occupation\": user_occupation_out(inputs[\"Occupation\"]),\n",
    "            }\n",
    "            return tf.keras.Model(inputs=inputs, outputs=outputs)\n",
    "\n",
    "        preprocess_layer = preprocess()\n",
    "        model = DeepFMbase(\n",
    "            dnn_units_size=[256, 32],\n",
    "            preprocess_layer=preprocess_layer,\n",
    "        )\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model  # need wrap\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Define Bob's basenet and fusenet. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# bob model\n",
    "def create_base_model_bob():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        # define preprocess layer\n",
    "        def preprocess():\n",
    "            inputs = {\n",
    "                \"MovieID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "                \"Genres\": tf.keras.Input(shape=(1,), dtype=tf.string),\n",
    "            }\n",
    "\n",
    "            movie_id_out = tf.keras.layers.Hashing(\n",
    "                num_bins=NUM_MOVIES, output_mode=\"one_hot\"\n",
    "            )\n",
    "            movie_genres_out = tf.keras.layers.TextVectorization(\n",
    "                output_mode='multi_hot', split=\"whitespace\", vocabulary=GENRES_VOCAB\n",
    "            )\n",
    "            outputs = {\n",
    "                \"MovieID\": movie_id_out(inputs[\"MovieID\"]),\n",
    "                \"Genres\": movie_genres_out(inputs[\"Genres\"]),\n",
    "            }\n",
    "            return tf.keras.Model(inputs=inputs, outputs=outputs)\n",
    "\n",
    "        preprocess_layer = preprocess()\n",
    "\n",
    "        model = DeepFMbase(\n",
    "            dnn_units_size=[256, 32],\n",
    "            preprocess_layer=preprocess_layer,\n",
    "        )\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model  # need wrap\n",
    "\n",
    "    return create_model\n",
    "\n",
    "\n",
    "def create_fuse_model():\n",
    "    # Create model\n",
    "    def create_model():\n",
    "        import tensorflow as tf\n",
    "\n",
    "        model = DeepFMfuse(dnn_units_size=[256, 256, 32])\n",
    "        model.compile(\n",
    "            loss=tf.keras.losses.binary_crossentropy,\n",
    "            optimizer=tf.keras.optimizers.Adam(),\n",
    "            metrics=[\n",
    "                tf.keras.metrics.AUC(),\n",
    "                tf.keras.metrics.Precision(),\n",
    "                tf.keras.metrics.Recall(),\n",
    "            ],\n",
    "        )\n",
    "        return model\n",
    "\n",
    "    return create_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_model_dict = {alice: create_base_model_alice(), bob: create_base_model_bob()}\n",
    "model_fuse = create_fuse_model()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party alice.\n",
      "INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party bob.\n",
      "INFO:root:SL Train Params: {'self': <secretflow.ml.nn.sl.sl_model.SLModel object at 0x7fc9c99f51c0>, 'x': VDataFrame(partitions={PYURuntime(alice): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fc9c9998160>), PYURuntime(bob): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fc9c998c520>)}, aligned=True), 'y': VDataFrame(partitions={PYURuntime(bob): Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fc9c99984f0>)}, aligned=True), 'batch_size': 128, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': None, 'shuffle': False, 'sample_weight': None, 'validation_freq': 1, 'dp_spent_step_freq': None, 'dataset_builder': {PYURuntime(alice): <function create_dataset_builder_alice.<locals>.dataset_builder at 0x7fcb640b9700>, PYURuntime(bob): <function create_dataset_builder_bob.<locals>.dataset_builder at 0x7fca2c6b30d0>}, 'audit_log_dir': None, 'audit_log_params': {}, 'random_seed': 1234}\n",
      "  0%|                                                                                             | 0/8 [00:00<?, ?it/s]"
     ]
    }
   ],
   "source": [
    "from secretflow.data.vertical import read_csv as v_read_csv\n",
    "\n",
    "vdf = v_read_csv(\n",
    "    {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}, keys=\"ID\", drop_keys=\"ID\"\n",
    ")\n",
    "label = vdf[\"Rating\"]\n",
    "\n",
    "data = vdf.drop(columns=[\"Rating\", \"Timestamp\", \"Title\", \"Zip-code\"])\n",
    "data[\"UserID\"] = data[\"UserID\"].astype(\"string\")\n",
    "data[\"MovieID\"] = data[\"MovieID\"].astype(\"string\")\n",
    "\n",
    "sl_model = SLModel(\n",
    "    base_model_dict=base_model_dict,\n",
    "    device_y=bob,\n",
    "    model_fuse=model_fuse,\n",
    ")\n",
    "history = sl_model.fit(\n",
    "    data,\n",
    "    label,\n",
    "    epochs=5,\n",
    "    batch_size=128,\n",
    "    random_seed=1234,\n",
    "    dataset_builder=data_builder_dict,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (sf)",
   "language": "python",
   "name": "sf"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}