Using Custom DataBuilder in SecretFlow (TensorFlow)#

The following codes are demos only. It’s NOT for production due to system security concerns, please DO NOT use it directly in production.

In this tutorial, we will show you how to load data and train model using the custom DataBuilder schema in the multi-party secure environment of SecretFlow. This tutorial will use the image classification task of the Flower dataset to introduce, how to use the custom DataBuilder to complete federated learning.

Environment Setting#

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=False)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')

2023-04-17 15:18:33,602 INFO worker.py:1538 -- Started a local Ray instance.

Interface Introduction#

We support custom DataBuilder reads in SecretFlow’s FLModel to make it easier for users to handle data inputs more flexibly according to their needs. Let’s use an example to demonstrate how to use the custom DataBuilder for federated model training.

Steps to use DataBuilder:

Use the single-machine version engine (TensorFlow, PyTorch) to develop and get the Builder function of the Dataset.
Wrap the Builder functions of each party to get create_dataset_builder function. Note: The dataset_builder needs to pass in the stage parameter.
Build the data_builder_dict [PYU, dataset_builder].
Pass the obtained data_builder_dict to the dataset_builder of the fit function. At the same time, the x parameter position is passed into the required input in dataset_builder (eg: the input passed in this example is the actual image path used).

Using DataBuilder in FLModel requires a pre-defined data_builder_dict. Need to be able to return tf.dataset and steps_per_epoch. And the steps_per_epoch returned by all parties must be consistent.

data_builder_dict =
        {
            alice: create_alice_dataset_builder(
                batch_size=32,
            ), # create_alice_dataset_builder must return (Dataset, steps_per_epoch)
            bob: create_bob_dataset_builder(
                batch_size=32,
            ), # create_bob_dataset_builder must return (Dataset, steps_per_epochstep_per_epochs)
        }

Download Data#

Flower Dataset Introduction: The Flower dataset consists of 4323 color images of 5 different types of flowers (daisy, dandelion, rose, sunflower, and tulip). Each flower has images from multiple angles and different lighting conditions, and the resolution of each image is 320x240. This dataset is commonly used for training and testing of image classification and machine learning algorithms. The number of each category in the dataset is as follows: daisy (633), dandelion (898), rose (641), sunflower (699), and tulip (852).

Download link: http://download.tensorflow.org/example_images/flower_photos.tgz

Download Data and Unzip#

[3]:

import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)

Downloading data from https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz
67588319/67588319 [==============================] - 1s 0us/step

Next let’s start building a custom DataBuilder

1. Develop DataBuilder with single-machine version engine#

When we develop DataBuilder, we are free to follow the logic of single-machine development. The purpose is to build a tf.dataset object.

[4]:

import math
import tensorflow as tf

img_height = 180
img_width = 180
batch_size = 32
# In this example, we use the TensorFlow interface for development.
data_set = tf.keras.utils.image_dataset_from_directory(
    path_to_flower_dataset,
    validation_split=0.2,
    subset="both",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
)

Found 436 files belonging to 5 classes.
Using 349 files for training.
Using 87 files for validation.

2023-04-10 13:16:34.492390: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

[5]:

train_set = data_set[0]
test_set = data_set[1]

[6]:

print(type(train_set), type(test_set))

<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>

[7]:

x, y = next(iter(train_set))
print(f"x.shape = {x.shape}")
print(f"y.shape = {y.shape}")

x.shape = (32, 180, 180, 3)
y.shape = (32,)

2. Wrap the developed DataBuilder#

The DataBuilder we developed needs to be distributed to each execution machine for execution, and we need to wrap them in order to serialize. Note: FLModel requires the incoming DataBuilder return two results (data_set, steps_per_epoch).

[8]:

def create_dataset_builder(
    batch_size=32,
):
    def dataset_builder(folder_path, stage="train"):
        import math

        import tensorflow as tf

        img_height = 180
        img_width = 180
        data_set = tf.keras.utils.image_dataset_from_directory(
            folder_path,
            validation_split=0.2,
            subset="both",
            seed=123,
            image_size=(img_height, img_width),
            batch_size=batch_size,
        )
        if stage == "train":
            train_dataset = data_set[0]
            train_step_per_epoch = math.ceil(len(data_set[0].file_paths) / batch_size)
            return train_dataset, train_step_per_epoch
        elif stage == "eval":
            eval_dataset = data_set[1]
            eval_step_per_epoch = math.ceil(len(data_set[1].file_paths) / batch_size)
            return eval_dataset, eval_step_per_epoch

    return dataset_builder

3. Build dataset_builder_dict#

In the horizontal scenario, the logic for all parties to process data is the same, so we only need a wrapped DataBuilder construction method. Next we build the dataset_builder_dict

[9]:

data_builder_dict = {
    alice: create_dataset_builder(
        batch_size=32,
    ),
    bob: create_dataset_builder(
        batch_size=32,
    ),
}

4. After get dataset_builder_dict, we can pass it into the model for use#

Next we define the model and use the custom data constructed above for training

[10]:

def create_conv_flower_model(input_shape, num_classes, name='model'):
    def create_model():
        from tensorflow import keras

        # Create model

        model = keras.Sequential(
            [
                keras.Input(shape=input_shape),
                tf.keras.layers.Rescaling(1.0 / 255),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Flatten(),
                tf.keras.layers.Dense(128, activation='relu'),
                tf.keras.layers.Dense(num_classes),
            ]
        )
        # Compile model
        model.compile(
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            optimizer='adam',
            metrics=["accuracy"],
        )
        return model

    return create_model

[11]:

from secretflow.ml.nn import FLModel
from secretflow.security.aggregation import SecureAggregator

[12]:

device_list = [alice, bob]
aggregator = SecureAggregator(charlie, [alice, bob])

# prepare model
num_classes = 5
input_shape = (180, 180, 3)

# keras model
model = create_conv_flower_model(input_shape, num_classes)


fed_model = FLModel(
    device_list=device_list,
    model=model,
    aggregator=aggregator,
    backend="tensorflow",
    strategy="fed_avg_w",
    random_seed=1234,
)

INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party bob.

The input of our constructed dataset builder is the path of the image dataset, so we need to set the input data as a Dict here.

data = {
    alice: folder_path_of_alice,
    bob: folder_path_of_bob
}

[13]:

data = {
    alice: path_to_flower_dataset,
    bob: path_to_flower_dataset,
}
history = fed_model.fit(
    data,
    None,
    validation_data=data,
    epochs=5,
    batch_size=32,
    aggregate_freq=2,
    sampler_method="batch",
    random_seed=1234,
    dp_spent_step_freq=1,
    dataset_builder=data_builder_dict,
)

INFO:root:FL Train Params: {'self': <secretflow.ml.nn.fl.fl_model.FLModel object at 0x7f7b7a28b8e0>, 'x': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'y': None, 'batch_size': 32, 'batch_sampling_rate': None, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'shuffle': False, 'class_weight': None, 'sample_weight': None, 'validation_freq': 1, 'aggregate_freq': 2, 'label_decoder': None, 'max_batch_size': 20000, 'prefetch_buffer_size': None, 'sampler_method': 'batch', 'random_seed': 1234, 'dp_spent_step_freq': 1, 'audit_log_dir': None, 'dataset_builder': {alice: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb1f0>, bob: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb0d0>}}
32it [00:18,  1.71it/s, epoch: 1/5 -  loss:1.5339548587799072  accuracy:0.3142559826374054  val_loss:1.582740068435669  val_accuracy:0.2874999940395355 ]
100%|██████████| 8/8 [00:05<00:00,  1.51it/s, epoch: 2/5 -  loss:1.4520319700241089  accuracy:0.36693549156188965  val_loss:1.3319271802902222  val_accuracy:0.40416666865348816 ]
100%|██████████| 8/8 [00:05<00:00,  1.54it/s, epoch: 3/5 -  loss:1.2720597982406616  accuracy:0.45766130089759827  val_loss:1.3382091522216797  val_accuracy:0.47083333134651184 ]
100%|██████████| 8/8 [00:05<00:00,  1.50it/s, epoch: 4/5 -  loss:1.229131817817688  accuracy:0.5040322542190552  val_loss:1.3033963441848755  val_accuracy:0.4375 ]
100%|██████████| 8/8 [00:05<00:00,  1.59it/s, epoch: 5/5 -  loss:1.3306885957717896  accuracy:0.4301075339317322  val_loss:2.1492652893066406  val_accuracy:0.25833332538604736 ]

Next, you can use your own dataset to try