{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Customize DataBuilder on SLModel in SecretFlow" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We support federated learning in vertical scenarios on SLModel in SecretFlow. In this tutorial, we will take DeepFM model for recommendation as an example to introduce how to customize a DataBuilder on SLModel. \n", "\n", "The main goal of this tutorial is to train DeepFM to show how to customize a DataBuilder on SLModel. \n", "\n", "*If you want to learn more about related content for DeepFM model and split plans, please move to related documents for DeepFM model.* " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Environment Setting" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The version of SecretFlow: 0.8.3b0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-07-03 20:12:02,535\tINFO worker.py:1538 -- Started a local Ray instance.\n" ] } ], "source": [ "import secretflow as sf\n", "\n", "# Check the version of your SecretFlow\n", "print('The version of SecretFlow: {}'.format(sf.__version__))\n", "\n", "# In case you have a running secretflow runtime already.\n", "sf.shutdown()\n", "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=False)\n", "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We will use the classic dataset [MovieLens](https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip) for introduction. MovieLens is an open recommendation system dataset that contains movie ratings and movie metadata information. \n", "\n", "And we split data into several pieces: \n", "\n", "- alice: \"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\" \n", "- bob: \"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\" \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" } }, "source": [ "## Define DataBuilders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We customize a DataBuilder with the aim to meet higher level customization demands, as the existing FedDataFrame and FedNdarray did not provide the required functionality. \n", "\n", "In Split Learning(where all sides of the data need to be aligned), we only provide DataBuilder for CSV data for the time being. You can define what to do with each row of data in the custom DataBuilder. The operations in SLModel depend on the way you define it. \n", "\n", "The things you should notice here: \n", "\n", "- We read MovieLens dataset in CSV format. However, it needs to be treated as sparse features. The DataBuilder will convert each column into the form of dictionary. \n", "- Bob's score was processed into binary form using threshold. \n", "- Custom functions can be defined in the DataBuilder and then applied to each column using dataset.map. \n", "- Only CSV format is supported in the vertical scenario due to the restriction that both sides of the data should be aligned. \n", "- It returns Dataset which has been defined Batch_size and Repeat in the CSV mode so that SLModel can infer the Steps_per_epoch according to the Dataset. \n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Download the dataset and convert it to the CSV format." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "%%!\n", "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip\n", "unzip ./ml-1m.zip " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Read the data from the dat format and convert it to the dictionary form. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_data(filename, columns):\n", " data = {}\n", " with open(filename, \"r\", encoding=\"unicode_escape\") as f:\n", " for line in f:\n", " ls = line.strip(\"\\n\").split(\"::\")\n", " data[ls[0]] = dict(zip(columns[1:], ls[1:]))\n", " return data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [], "source": [ "users_data = load_data(\n", " \"./ml-1m/users.dat\",\n", " columns=[\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n", ")\n", "movies_data = load_data(\"./ml-1m/movies.dat\", columns=[\"MovieID\", \"Title\", \"Genres\"])\n", "ratings_columns = [\"UserID\", \"MovieID\", \"Rating\", \"Timestamp\"]\n", "\n", "rating_data = load_data(\"./ml-1m/ratings.dat\", columns=ratings_columns)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6040\n", "3883\n", "6040\n" ] } ], "source": [ "print(len(users_data))\n", "print(len(movies_data))\n", "print(len(rating_data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next, we join user, movie and rating, split up and assemble them into 'alice_ml1m.csv' and 'bob_ml1m.csv'. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "fed_csv = {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}\n", "csv_writer_container = {alice: open(fed_csv[alice], \"w\"), bob: open(fed_csv[bob], \"w\")}\n", "part_columns = {\n", " alice: [\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n", " bob: [\"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\"],\n", "}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "for device, writer in csv_writer_container.items():\n", " writer.write(\"ID,\" + \",\".join(part_columns[device]) + \"\\n\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "f = open(\"ml-1m/ratings.dat\", \"r\", encoding=\"unicode_escape\")\n", "\n", "\n", "def _parse_example(feature, columns, index):\n", " if \"Title\" in feature.keys():\n", " feature[\"Title\"] = feature[\"Title\"].replace(\",\", \"_\")\n", " if \"Genres\" in feature.keys():\n", " feature[\"Genres\"] = feature[\"Genres\"].replace(\"|\", \" \")\n", " values = []\n", " values.append(str(index))\n", " for c in columns:\n", " values.append(feature[c])\n", " return \",\".join(values)\n", "\n", "\n", "index = 0\n", "num_sample = 1000\n", "for line in f:\n", " ls = line.strip().split(\"::\")\n", " rating = dict(zip(ratings_columns, ls))\n", " rating.update(users_data.get(ls[0]))\n", " rating.update(movies_data.get(ls[1]))\n", " for device, columns in part_columns.items():\n", " parse_f = _parse_example(rating, columns, index)\n", " csv_writer_container[device].write(parse_f + \"\\n\")\n", " index += 1\n", " if num_sample > 0 and index >= num_sample:\n", " break\n", "for w in csv_writer_container.values():\n", " w.close()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## So we've done the data processing and splitting up by now. \n", "\n", "And we output two files: \n", "```\n", "Alice: alice_ml1m.csv and Bob: bob_ml1m.csv \n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,UserID,Gender,Age,Occupation,Zip-code\r\n", "0,1,F,1,10,48067\r\n", "1,1,F,1,10,48067\r\n", "2,1,F,1,10,48067\r\n", "3,1,F,1,10,48067\r\n", "4,1,F,1,10,48067\r\n", "5,1,F,1,10,48067\r\n", "6,1,F,1,10,48067\r\n", "7,1,F,1,10,48067\r\n", "8,1,F,1,10,48067\r\n" ] } ], "source": [ "! head alice_ml1m.csv" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,MovieID,Rating,Title,Genres,Timestamp\r\n", "0,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,978300760\r\n", "1,661,3,James and the Giant Peach (1996),Animation Children's Musical,978302109\r\n", "2,914,3,My Fair Lady (1964),Musical Romance,978301968\r\n", "3,3408,4,Erin Brockovich (2000),Drama,978300275\r\n", "4,2355,5,Bug's Life_ A (1998),Animation Children's Comedy,978824291\r\n", "5,1197,3,Princess Bride_ The (1987),Action Adventure Comedy Romance,978302268\r\n", "6,1287,5,Ben-Hur (1959),Action Adventure Drama,978302039\r\n", "7,2804,5,Christmas Story_ A (1983),Comedy Drama,978300719\r\n", "8,594,4,Snow White and the Seven Dwarfs (1937),Animation Children's Musical,978302268\r\n" ] } ], "source": [ "! head bob_ml1m.csv" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Use plaintext engines to develop Databuilders. \n", "\n", "Because the data for each side of the SLModel is different, DataBuilder needs to be developed separately. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Develop Alice's DataBuilder" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "alice_df = pd.read_csv(\"alice_ml1m.csv\", encoding=\"utf-8\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIDGenderAgeOccupationZip-code
01F11048067
11F11048067
21F11048067
31F11048067
41F11048067
..................
99510F35195370
99610F35195370
99710F35195370
99810F35195370
99910F35195370
\n", "

1000 rows × 5 columns

\n", "
" ], "text/plain": [ " UserID Gender Age Occupation Zip-code\n", "0 1 F 1 10 48067\n", "1 1 F 1 10 48067\n", "2 1 F 1 10 48067\n", "3 1 F 1 10 48067\n", "4 1 F 1 10 48067\n", ".. ... ... ... ... ...\n", "995 10 F 35 1 95370\n", "996 10 F 35 1 95370\n", "997 10 F 35 1 95370\n", "998 10 F 35 1 95370\n", "999 10 F 35 1 95370\n", "\n", "[1000 rows x 5 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alice_df[\"UserID\"] = alice_df[\"UserID\"].astype(\"string\")\n", "alice_df = alice_df.drop(columns=\"ID\")\n", "alice_df" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-07-03 20:23:17.128798: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2023-07-03 20:23:19.575681: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2023-07-03 20:23:19.576067: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2023-07-03 20:23:19.576096: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n", "2023-07-03 20:23:22.559302: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory\n", "2023-07-03 20:23:22.560651: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)\n" ] } ], "source": [ "import tensorflow as tf\n", "\n", "alice_dict = dict(alice_df)\n", "data_set = tf.data.Dataset.from_tensor_slices(alice_dict).batch(32).repeat(1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_set" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Develop Bob's DataBuilder" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MovieIDRatingTitleGenresTimestamp
011935One Flew Over the Cuckoo's Nest (1975)Drama978300760
16613James and the Giant Peach (1996)Animation Children's Musical978302109
29143My Fair Lady (1964)Musical Romance978301968
334084Erin Brockovich (2000)Drama978300275
423555Bug's Life_ A (1998)Animation Children's Comedy978824291
..................
99537042Mad Max Beyond Thunderdome (1985)Action Sci-Fi978228364
99610203Cool Runnings (1993)Comedy978228726
9977843Cable Guy_ The (1996)Comedy978230946
9988583Godfather_ The (1972)Action Crime Drama978224375
99910225Cinderella (1950)Animation Children's Musical979775689
\n", "

1000 rows × 5 columns

\n", "
" ], "text/plain": [ " MovieID Rating Title \\\n", "0 1193 5 One Flew Over the Cuckoo's Nest (1975) \n", "1 661 3 James and the Giant Peach (1996) \n", "2 914 3 My Fair Lady (1964) \n", "3 3408 4 Erin Brockovich (2000) \n", "4 2355 5 Bug's Life_ A (1998) \n", ".. ... ... ... \n", "995 3704 2 Mad Max Beyond Thunderdome (1985) \n", "996 1020 3 Cool Runnings (1993) \n", "997 784 3 Cable Guy_ The (1996) \n", "998 858 3 Godfather_ The (1972) \n", "999 1022 5 Cinderella (1950) \n", "\n", " Genres Timestamp \n", "0 Drama 978300760 \n", "1 Animation Children's Musical 978302109 \n", "2 Musical Romance 978301968 \n", "3 Drama 978300275 \n", "4 Animation Children's Comedy 978824291 \n", ".. ... ... \n", "995 Action Sci-Fi 978228364 \n", "996 Comedy 978228726 \n", "997 Comedy 978230946 \n", "998 Action Crime Drama 978224375 \n", "999 Animation Children's Musical 979775689 \n", "\n", "[1000 rows x 5 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bob_df = pd.read_csv(\"bob_ml1m.csv\", encoding=\"utf-8\")\n", "bob_df = bob_df.drop(columns=\"ID\")\n", "\n", "bob_df[\"MovieID\"] = bob_df[\"MovieID\"].astype(\"string\")\n", "\n", "bob_df" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "label = bob_df[\"Rating\"]\n", "data = bob_df.drop(columns=\"Rating\")\n", "\n", "\n", "def _parse_bob(row_sample, label):\n", " import tensorflow as tf\n", "\n", " y_t = label\n", " y = tf.expand_dims(\n", " tf.where(\n", " y_t > 3,\n", " tf.ones_like(y_t, dtype=tf.float32),\n", " tf.zeros_like(y_t, dtype=tf.float32),\n", " ),\n", " axis=1,\n", " )\n", " return row_sample, y" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "bob_dict = tuple([dict(data), label])\n", "data_set = tf.data.Dataset.from_tensor_slices(bob_dict).batch(32).repeat(1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "data_set = data_set.map(_parse_bob)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'MovieID': ,\n", " 'Title': ,\n", " 'Genres': ,\n", " 'Timestamp': },\n", " )" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(iter(data_set))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Wrap their DataBuilders separately." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# alice\n", "def create_dataset_builder_alice(\n", " batch_size=128,\n", " repeat_count=5,\n", "):\n", " def dataset_builder(x):\n", " import pandas as pd\n", " import tensorflow as tf\n", "\n", " x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n", " x = x[0] if len(x) == 1 else tuple(x)\n", " data_set = (\n", " tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n", " )\n", "\n", " return data_set\n", "\n", " return dataset_builder\n", "\n", "\n", "# bob\n", "def create_dataset_builder_bob(\n", " batch_size=128,\n", " repeat_count=5,\n", "):\n", " def _parse_bob(row_sample, label):\n", " import tensorflow as tf\n", "\n", " y_t = label[\"Rating\"]\n", " y = tf.expand_dims(\n", " tf.where(\n", " y_t > 3,\n", " tf.ones_like(y_t, dtype=tf.float32),\n", " tf.zeros_like(y_t, dtype=tf.float32),\n", " ),\n", " axis=1,\n", " )\n", " return row_sample, y\n", "\n", " def dataset_builder(x):\n", " import pandas as pd\n", " import tensorflow as tf\n", "\n", " x = [dict(t) if isinstance(t, pd.DataFrame) else t for t in x]\n", " x = x[0] if len(x) == 1 else tuple(x)\n", " data_set = (\n", " tf.data.Dataset.from_tensor_slices(x).batch(batch_size).repeat(repeat_count)\n", " )\n", "\n", " data_set = data_set.map(_parse_bob)\n", "\n", " return data_set\n", "\n", " return dataset_builder" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Construct a databuilder_dict." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "data_builder_dict = {\n", " alice: create_dataset_builder_alice(\n", " batch_size=128,\n", " repeat_count=5,\n", " ),\n", " bob: create_dataset_builder_bob(\n", " batch_size=128,\n", " repeat_count=5,\n", " ),\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define DeepFM Model and run it" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Define a DeepFM model. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from secretflow.ml.nn.applications.sl_deep_fm import DeepFMbase, DeepFMfuse\n", "from secretflow.ml.nn import SLModel\n", "\n", "NUM_USERS = 6040\n", "NUM_MOVIES = 3952\n", "GENDER_VOCAB = [\"F\", \"M\"]\n", "AGE_VOCAB = [1, 18, 25, 35, 45, 50, 56]\n", "OCCUPATION_VOCAB = [i for i in range(21)]\n", "GENRES_VOCAB = [\n", " \"Action\",\n", " \"Adventure\",\n", " \"Animation\",\n", " \"Children's\",\n", " \"Comedy\",\n", " \"Crime\",\n", " \"Documentary\",\n", " \"Drama\",\n", " \"Fantasy\",\n", " \"Film-Noir\",\n", " \"Horror\",\n", " \"Musical\",\n", " \"Mystery\",\n", " \"Romance\",\n", " \"Sci-Fi\",\n", " \"Thriller\",\n", " \"War\",\n", " \"Western\",\n", "]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Define Alice's basenet." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def create_base_model_alice():\n", " # Create model\n", " def create_model():\n", " import tensorflow as tf\n", "\n", " def preprocess():\n", " inputs = {\n", " \"UserID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n", " \"Gender\": tf.keras.Input(shape=(1,), dtype=tf.string),\n", " \"Age\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n", " \"Occupation\": tf.keras.Input(shape=(1,), dtype=tf.int64),\n", " }\n", " user_id_output = tf.keras.layers.Hashing(\n", " num_bins=NUM_USERS, output_mode=\"one_hot\"\n", " )\n", " user_gender_output = tf.keras.layers.StringLookup(\n", " vocabulary=GENDER_VOCAB, output_mode=\"one_hot\"\n", " )\n", "\n", " user_age_out = tf.keras.layers.IntegerLookup(\n", " vocabulary=AGE_VOCAB, output_mode=\"one_hot\"\n", " )\n", " user_occupation_out = tf.keras.layers.IntegerLookup(\n", " vocabulary=OCCUPATION_VOCAB, output_mode=\"one_hot\"\n", " )\n", "\n", " outputs = {\n", " \"UserID\": user_id_output(inputs[\"UserID\"]),\n", " \"Gender\": user_gender_output(inputs[\"Gender\"]),\n", " \"Age\": user_age_out(inputs[\"Age\"]),\n", " \"Occupation\": user_occupation_out(inputs[\"Occupation\"]),\n", " }\n", " return tf.keras.Model(inputs=inputs, outputs=outputs)\n", "\n", " preprocess_layer = preprocess()\n", " model = DeepFMbase(\n", " dnn_units_size=[256, 32],\n", " preprocess_layer=preprocess_layer,\n", " )\n", " model.compile(\n", " loss=tf.keras.losses.binary_crossentropy,\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=[\n", " tf.keras.metrics.AUC(),\n", " tf.keras.metrics.Precision(),\n", " tf.keras.metrics.Recall(),\n", " ],\n", " )\n", " return model # need wrap\n", "\n", " return create_model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Define Bob's basenet and fusenet. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# bob model\n", "def create_base_model_bob():\n", " # Create model\n", " def create_model():\n", " import tensorflow as tf\n", "\n", " # define preprocess layer\n", " def preprocess():\n", " inputs = {\n", " \"MovieID\": tf.keras.Input(shape=(1,), dtype=tf.string),\n", " \"Genres\": tf.keras.Input(shape=(1,), dtype=tf.string),\n", " }\n", "\n", " movie_id_out = tf.keras.layers.Hashing(\n", " num_bins=NUM_MOVIES, output_mode=\"one_hot\"\n", " )\n", " movie_genres_out = tf.keras.layers.TextVectorization(\n", " output_mode='multi_hot', split=\"whitespace\", vocabulary=GENRES_VOCAB\n", " )\n", " outputs = {\n", " \"MovieID\": movie_id_out(inputs[\"MovieID\"]),\n", " \"Genres\": movie_genres_out(inputs[\"Genres\"]),\n", " }\n", " return tf.keras.Model(inputs=inputs, outputs=outputs)\n", "\n", " preprocess_layer = preprocess()\n", "\n", " model = DeepFMbase(\n", " dnn_units_size=[256, 32],\n", " preprocess_layer=preprocess_layer,\n", " )\n", " model.compile(\n", " loss=tf.keras.losses.binary_crossentropy,\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=[\n", " tf.keras.metrics.AUC(),\n", " tf.keras.metrics.Precision(),\n", " tf.keras.metrics.Recall(),\n", " ],\n", " )\n", " return model # need wrap\n", "\n", " return create_model\n", "\n", "\n", "def create_fuse_model():\n", " # Create model\n", " def create_model():\n", " import tensorflow as tf\n", "\n", " model = DeepFMfuse(dnn_units_size=[256, 256, 32])\n", " model.compile(\n", " loss=tf.keras.losses.binary_crossentropy,\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=[\n", " tf.keras.metrics.AUC(),\n", " tf.keras.metrics.Precision(),\n", " tf.keras.metrics.Recall(),\n", " ],\n", " )\n", " return model\n", "\n", " return create_model" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "base_model_dict = {alice: create_base_model_alice(), bob: create_base_model_bob()}\n", "model_fuse = create_fuse_model()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n", "INFO:root:SL Train Params: {'self': , 'x': VDataFrame(partitions={PYURuntime(alice): Partition(data=), PYURuntime(bob): Partition(data=)}, aligned=True), 'y': VDataFrame(partitions={PYURuntime(bob): Partition(data=)}, aligned=True), 'batch_size': 128, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': None, 'shuffle': False, 'sample_weight': None, 'validation_freq': 1, 'dp_spent_step_freq': None, 'dataset_builder': {PYURuntime(alice): .dataset_builder at 0x7fcb640b9700>, PYURuntime(bob): .dataset_builder at 0x7fca2c6b30d0>}, 'audit_log_dir': None, 'audit_log_params': {}, 'random_seed': 1234}\n", " 0%| | 0/8 [00:00