{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Customize DataBuilder on SLModel in SecretFlow" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We support federated learning in vertical scenarios on SLModel in SecretFlow. In this tutorial, we will take DeepFM model for recommendation as an example to introduce how to customize a DataBuilder on SLModel. \n", "\n", "The main goal of this tutorial is to train DeepFM to show how to customize a DataBuilder on SLModel. \n", "\n", "*If you want to learn more about related content for DeepFM model and split plans, please move to related documents for DeepFM model.* " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Environment Setting" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The version of SecretFlow: 0.8.3b0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-07-03 20:12:02,535\tINFO worker.py:1538 -- Started a local Ray instance.\n" ] } ], "source": [ "import secretflow as sf\n", "\n", "# Check the version of your SecretFlow\n", "print('The version of SecretFlow: {}'.format(sf.__version__))\n", "\n", "# In case you have a running secretflow runtime already.\n", "sf.shutdown()\n", "sf.init(['alice', 'bob', 'charlie'], address=\"local\", log_to_driver=False)\n", "alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We will use the classic dataset [MovieLens](https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip) for introduction. MovieLens is an open recommendation system dataset that contains movie ratings and movie metadata information. \n", "\n", "And we split data into several pieces: \n", "\n", "- alice: \"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\" \n", "- bob: \"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\" \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%%\n" } }, "source": [ "## Define DataBuilders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We customize a DataBuilder with the aim to meet higher level customization demands, as the existing FedDataFrame and FedNdarray did not provide the required functionality. \n", "\n", "In Split Learning(where all sides of the data need to be aligned), we only provide DataBuilder for CSV data for the time being. You can define what to do with each row of data in the custom DataBuilder. The operations in SLModel depend on the way you define it. \n", "\n", "The things you should notice here: \n", "\n", "- We read MovieLens dataset in CSV format. However, it needs to be treated as sparse features. The DataBuilder will convert each column into the form of dictionary. \n", "- Bob's score was processed into binary form using threshold. \n", "- Custom functions can be defined in the DataBuilder and then applied to each column using dataset.map. \n", "- Only CSV format is supported in the vertical scenario due to the restriction that both sides of the data should be aligned. \n", "- It returns Dataset which has been defined Batch_size and Repeat in the CSV mode so that SLModel can infer the Steps_per_epoch according to the Dataset. \n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Download the dataset and convert it to the CSV format." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "%%!\n", "wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/movielens/ml-1m.zip\n", "unzip ./ml-1m.zip " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Read the data from the dat format and convert it to the dictionary form. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_data(filename, columns):\n", " data = {}\n", " with open(filename, \"r\", encoding=\"unicode_escape\") as f:\n", " for line in f:\n", " ls = line.strip(\"\\n\").split(\"::\")\n", " data[ls[0]] = dict(zip(columns[1:], ls[1:]))\n", " return data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [], "source": [ "users_data = load_data(\n", " \"./ml-1m/users.dat\",\n", " columns=[\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n", ")\n", "movies_data = load_data(\"./ml-1m/movies.dat\", columns=[\"MovieID\", \"Title\", \"Genres\"])\n", "ratings_columns = [\"UserID\", \"MovieID\", \"Rating\", \"Timestamp\"]\n", "\n", "rating_data = load_data(\"./ml-1m/ratings.dat\", columns=ratings_columns)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6040\n", "3883\n", "6040\n" ] } ], "source": [ "print(len(users_data))\n", "print(len(movies_data))\n", "print(len(rating_data))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next, we join user, movie and rating, split up and assemble them into 'alice_ml1m.csv' and 'bob_ml1m.csv'. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "fed_csv = {alice: \"alice_ml1m.csv\", bob: \"bob_ml1m.csv\"}\n", "csv_writer_container = {alice: open(fed_csv[alice], \"w\"), bob: open(fed_csv[bob], \"w\")}\n", "part_columns = {\n", " alice: [\"UserID\", \"Gender\", \"Age\", \"Occupation\", \"Zip-code\"],\n", " bob: [\"MovieID\", \"Rating\", \"Title\", \"Genres\", \"Timestamp\"],\n", "}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "for device, writer in csv_writer_container.items():\n", " writer.write(\"ID,\" + \",\".join(part_columns[device]) + \"\\n\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "f = open(\"ml-1m/ratings.dat\", \"r\", encoding=\"unicode_escape\")\n", "\n", "\n", "def _parse_example(feature, columns, index):\n", " if \"Title\" in feature.keys():\n", " feature[\"Title\"] = feature[\"Title\"].replace(\",\", \"_\")\n", " if \"Genres\" in feature.keys():\n", " feature[\"Genres\"] = feature[\"Genres\"].replace(\"|\", \" \")\n", " values = []\n", " values.append(str(index))\n", " for c in columns:\n", " values.append(feature[c])\n", " return \",\".join(values)\n", "\n", "\n", "index = 0\n", "num_sample = 1000\n", "for line in f:\n", " ls = line.strip().split(\"::\")\n", " rating = dict(zip(ratings_columns, ls))\n", " rating.update(users_data.get(ls[0]))\n", " rating.update(movies_data.get(ls[1]))\n", " for device, columns in part_columns.items():\n", " parse_f = _parse_example(rating, columns, index)\n", " csv_writer_container[device].write(parse_f + \"\\n\")\n", " index += 1\n", " if num_sample > 0 and index >= num_sample:\n", " break\n", "for w in csv_writer_container.values():\n", " w.close()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## So we've done the data processing and splitting up by now. \n", "\n", "And we output two files: \n", "```\n", "Alice: alice_ml1m.csv and Bob: bob_ml1m.csv \n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,UserID,Gender,Age,Occupation,Zip-code\r\n", "0,1,F,1,10,48067\r\n", "1,1,F,1,10,48067\r\n", "2,1,F,1,10,48067\r\n", "3,1,F,1,10,48067\r\n", "4,1,F,1,10,48067\r\n", "5,1,F,1,10,48067\r\n", "6,1,F,1,10,48067\r\n", "7,1,F,1,10,48067\r\n", "8,1,F,1,10,48067\r\n" ] } ], "source": [ "! head alice_ml1m.csv" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,MovieID,Rating,Title,Genres,Timestamp\r\n", "0,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,978300760\r\n", "1,661,3,James and the Giant Peach (1996),Animation Children's Musical,978302109\r\n", "2,914,3,My Fair Lady (1964),Musical Romance,978301968\r\n", "3,3408,4,Erin Brockovich (2000),Drama,978300275\r\n", "4,2355,5,Bug's Life_ A (1998),Animation Children's Comedy,978824291\r\n", "5,1197,3,Princess Bride_ The (1987),Action Adventure Comedy Romance,978302268\r\n", "6,1287,5,Ben-Hur (1959),Action Adventure Drama,978302039\r\n", "7,2804,5,Christmas Story_ A (1983),Comedy Drama,978300719\r\n", "8,594,4,Snow White and the Seven Dwarfs (1937),Animation Children's Musical,978302268\r\n" ] } ], "source": [ "! head bob_ml1m.csv" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Use plaintext engines to develop Databuilders. \n", "\n", "Because the data for each side of the SLModel is different, DataBuilder needs to be developed separately. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Develop Alice's DataBuilder" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "alice_df = pd.read_csv(\"alice_ml1m.csv\", encoding=\"utf-8\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | UserID | \n", "Gender | \n", "Age | \n", "Occupation | \n", "Zip-code | \n", "
|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "F | \n", "1 | \n", "10 | \n", "48067 | \n", "
| 1 | \n", "1 | \n", "F | \n", "1 | \n", "10 | \n", "48067 | \n", "
| 2 | \n", "1 | \n", "F | \n", "1 | \n", "10 | \n", "48067 | \n", "
| 3 | \n", "1 | \n", "F | \n", "1 | \n", "10 | \n", "48067 | \n", "
| 4 | \n", "1 | \n", "F | \n", "1 | \n", "10 | \n", "48067 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 995 | \n", "10 | \n", "F | \n", "35 | \n", "1 | \n", "95370 | \n", "
| 996 | \n", "10 | \n", "F | \n", "35 | \n", "1 | \n", "95370 | \n", "
| 997 | \n", "10 | \n", "F | \n", "35 | \n", "1 | \n", "95370 | \n", "
| 998 | \n", "10 | \n", "F | \n", "35 | \n", "1 | \n", "95370 | \n", "
| 999 | \n", "10 | \n", "F | \n", "35 | \n", "1 | \n", "95370 | \n", "
1000 rows × 5 columns
\n", "| \n", " | MovieID | \n", "Rating | \n", "Title | \n", "Genres | \n", "Timestamp | \n", "
|---|---|---|---|---|---|
| 0 | \n", "1193 | \n", "5 | \n", "One Flew Over the Cuckoo's Nest (1975) | \n", "Drama | \n", "978300760 | \n", "
| 1 | \n", "661 | \n", "3 | \n", "James and the Giant Peach (1996) | \n", "Animation Children's Musical | \n", "978302109 | \n", "
| 2 | \n", "914 | \n", "3 | \n", "My Fair Lady (1964) | \n", "Musical Romance | \n", "978301968 | \n", "
| 3 | \n", "3408 | \n", "4 | \n", "Erin Brockovich (2000) | \n", "Drama | \n", "978300275 | \n", "
| 4 | \n", "2355 | \n", "5 | \n", "Bug's Life_ A (1998) | \n", "Animation Children's Comedy | \n", "978824291 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 995 | \n", "3704 | \n", "2 | \n", "Mad Max Beyond Thunderdome (1985) | \n", "Action Sci-Fi | \n", "978228364 | \n", "
| 996 | \n", "1020 | \n", "3 | \n", "Cool Runnings (1993) | \n", "Comedy | \n", "978228726 | \n", "
| 997 | \n", "784 | \n", "3 | \n", "Cable Guy_ The (1996) | \n", "Comedy | \n", "978230946 | \n", "
| 998 | \n", "858 | \n", "3 | \n", "Godfather_ The (1972) | \n", "Action Crime Drama | \n", "978224375 | \n", "
| 999 | \n", "1022 | \n", "5 | \n", "Cinderella (1950) | \n", "Animation Children's Musical | \n", "979775689 | \n", "
1000 rows × 5 columns
\n", "