Skip to content

Commit

Permalink
Merge branch 'admin_big_file' of github.com:yanchengnv/NVFlare into a…
Browse files Browse the repository at this point in the history
…dmin_big_file
  • Loading branch information
yanchengnv committed Sep 12, 2023
2 parents 9771d36 + f14ebfe commit 5dd6f9e
Show file tree
Hide file tree
Showing 189 changed files with 15,088 additions and 904 deletions.
6 changes: 3 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ preferred-citation:
- family-names: Feng
given-names: Andrew
doi: "https://doi.org/10.48550/arXiv.2210.13291"
journal: "International Workshop on Federated Learning, NeurIPS 2022, New Orleans, USA"
month: 10
journal: "IEEE Data Eng. Bull., Vol. 46, No. 1"
month: 3
title: "NVIDIA FLARE: Federated Learning from Simulation to Real-World"
year: 2022
year: 2023
7 changes: 6 additions & 1 deletion docs/publications_and_talks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,14 @@ Publications
Non-exhaustive list of papers and publications related to NVIDIA FLARE,
including papers using NVIDIA FLARE's predecessor libraries included in the `Clara Train SDK <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/clara-train-sdk>`__.

Publications: 2023
------------------
* **2023-10** `ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data <https://arxiv.org/abs/2308.04070>`__ (`DeCaF @ MICCAI 2023 <https://decaf-workshop.github.io/decaf-2023/>`__)
* **2023-06** `Fair Federated Medical Image Segmentation via Client Contribution Estimation <https://arxiv.org/abs/2303.16520>`__ (`CVPR 2023 <https://cvpr2023.thecvf.com/Conferences/2023/>`__)
* **2023-03** `FLARE: Federated Learning from Simulation to Real-World <https://arxiv.org/abs/2210.13291>`__ (`IEEE Data Eng. Bull. March 2023, Vol. 46, No. 1, <http://sites.computer.org/debull/A23mar/issue1.htm>`__)

Publications: 2022
------------------
* **2022-12** `FLARE: Federated Learning from Simulation to Real-World <https://arxiv.org/abs/2210.13291>`__ (`International Workshop on Federated Learning, NeurIPS 2022, New Orleans, USA <https://federated-learning.org/fl-neurips-2022>`__)
* **2022-10** `Auto-FedRL: Federated Hyperparameter Optimization for Multi-institutional Medical Image Segmentation <https://arxiv.org/abs/2203.06338>`__ (`ECCV 2022 <https://eccv2022.ecva.net/>`__)
* **2022-10** `Joint Multi Organ and Tumor Segmentation from Partial Labels Using Federated Learning <https://link.springer.com/chapter/10.1007/978-3-031-18523-6_6>`__ (`DeCaF @ MICCAI 2022 <https://decaf-workshop.github.io/decaf-2022/>`__)
* **2022-10** `Split-U-Net: Preventing Data Leakage in Split Learning for Collaborative Multi-modal Brain Tumor Segmentation <https://arxiv.org/abs/2208.10553>`__ (`DeCaF @ MICCAI 2022 <https://decaf-workshop.github.io/decaf-2022/>`__)
Expand Down
3 changes: 3 additions & 0 deletions examples/hello-world/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ Before you run the notebook, the following preparation work must be done:
* [Hello world notebook](./hello_world.ipynb)

## Hello World Examples
### Easier ML/DL to FL transition
* [ML to FL](./ml-to-fl/README.md): Showcase how to convert existing ML/DL codes to a NVFlare job.

### Workflows
* [Hello Scatter and Gather](./hello-numpy-sag/README.md)
* Example using "[ScatterAndGather](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.scatter_and_gather.html)" controller workflow.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{
"format_version": 2,
"executors": [
{
"tasks": [
"train", "submit_model", "validate"
],
"executor": {
"path": "nvflare.app_common.ccwf.comps.np_trainer.NPTrainer",
"args": {}
}
},
{
"tasks": ["swarm_*"],
"executor": {
"path": "nvflare.app_common.ccwf.SwarmClientController",
"args": {
"learn_task_name": "train",
"learn_task_timeout": 5.0,
"persistor_id": "persistor",
"aggregator_id": "aggregator",
"shareable_generator_id": "shareable_generator",
"min_responses_required": 2,
"wait_time_after_min_resps_received": 1
}
}
},
{
"tasks": ["cse_*"],
"executor": {
"path": "nvflare.app_common.ccwf.CrossSiteEvalClientController",
"args": {
"submit_model_task_name": "submit_model",
"validation_task_name": "validate",
"persistor_id": "persistor"
}
}
}
],
"task_result_filters": [],
"task_data_filters": [],
"components": [
{
"id": "persistor",
"path": "nvflare.app_common.ccwf.comps.np_file_model_persistor.NPFileModelPersistor",
"args": {}
},
{
"id": "shareable_generator",
"path": "nvflare.app_common.ccwf.comps.simple_model_shareable_generator.SimpleModelShareableGenerator",
"args": {}
},
{
"id": "aggregator",
"path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator",
"args": {
"expected_data_kind": "WEIGHT_DIFF"
}
},
{
"id": "model_selector",
"path": "nvflare.app_common.ccwf.comps.simple_intime_model_selector.SimpleIntimeModelSelector",
"args": {}
},
{
"id": "result_printer",
"path": "nvflare.app_common.ccwf.comps.cwe_result_printer.CWEResultPrinter",
"args": {}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"format_version": 2,
"task_data_filters": [],
"task_result_filters": [],
"components": [
{
"id": "json_generator",
"name": "ValidationJsonGenerator",
"args": {}
}
],
"workflows": [
{
"id": "swarm_controller",
"path": "nvflare.app_common.ccwf.SwarmServerController",
"args": {
"num_rounds": 3
}
},
{
"id": "cross_site_eval",
"path": "nvflare.app_common.ccwf.CrossSiteEvalServerController",
"args": {}
}
]
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "subprocess with file pipe with pytorch",
"name": "hello-numpy-sw-cse",
"resource_spec": {},
"min_clients" : 2,
"deploy_map": {
Expand Down
201 changes: 188 additions & 13 deletions examples/hello-world/ml-to-fl/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,202 @@
# ML to FL transition with NVFlare
# Deep Learning to Federated Learning transition with NVFlare

Converting Machine Learning or Deep Learning to FL is not easy, as it involves:
Converting Deep Learning (DL) to Federated Learning (FL) is not easy, as it involves:

1. Algorithms formulation, how to formulate an ML/DL to FL algorithm and what information needs to be pass between Client and Server
1. Algorithms formulation, how to formulate a DL to FL algorithm and what information needs to be pass between Client and Server

2. Convert existing standalone, centralized ML/DL code to FL code.
2. Convert existing standalone, centralized DL code to FL code.

3. Configure the workflow to use the newly changed code.

In this example, we assume #1 algorithm formulation is fixed (FedAvg).
We are showing #2, that is how to quickly convert the centralized DL to FL.
In this example, we assume algorithm formulation is fixed (FedAvg).
We are showing how to quickly convert the centralized DL to FL.
We will demonstrate different techniques depending on the existing code structure and preferences.

For #3 one can reference to the config we have here and the documentation.
To configure the workflow, one can reference to the config we have here and the documentation.

In this directory, we are providing job configurations to showcase how to utilize
`LauncherExecutor`, `Launcher` and several NVFlare interfaces to simplify the
transition from your ML code to FL with NVFlare.
transition from your DL code to FL with NVFlare.

We will demonstrate how to transform an existing DL code into FL application step-by-step:

## Examples
1. Show a base line training script [the base line](#the-base-line)
2. How to modify a non-structured script using DL2FL Client API [the Client API usage example](#transform-cifar10-dl-training-code-to-fl-including-best-model-selection-using-client-api)
3. How to modify a structured script using DL2FL decorator [the decorator usage example](#the-decorator-use-case)
4. How to modify a structured "lightning" script using DL2FL Lightning Client API [the lightning use case](#transform-cifar10-lightning-training-code-to-fl-with-nvflare-client-lightning-integration-api)

- [client_api1](./jobs/client_api1/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API
- [client_api2](./jobs/client_api2/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API with model selection
- [decorator](./jobs/decorator/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API decorator
- [lightning_client_api](./jobs/lightning_client_api/): Re-write PyTorch Lightning auto-encoder example to federated learning job using NVFlare lightning client API
## The base line

We take a CIFAR10 example directly from [PyTorch website](https://github.com/pytorch/tutorials/blob/main/beginner_source/blitz/cifar10_tutorial.py) and do the following cleanups to get [cifar10_tutorial_clean.py](./codes/cifar10_tutorial_clean.py):

1. Remove the comments
2. Move the definition of Convolutional Neural Network to [net.py](./codes/net.py)
3. Wrap the whole code inside a main method (https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods)
4. Add the ability to run on GPU to speed up the training process (optional)

You can run the baseline using

```bash
python3 ./codes/cifar10_tutorial_clean.py
```

It will run for 2 epochs and you will see something like:

```bash
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
[1, 2000] loss: 2.127
[1, 4000] loss: 1.826
[1, 6000] loss: 1.667
[1, 8000] loss: 1.568
[1, 10000] loss: 1.503
[1, 12000] loss: 1.455
[2, 2000] loss: 1.386
[2, 4000] loss: 1.362
[2, 6000] loss: 1.348
[2, 8000] loss: 1.329
[2, 10000] loss: 1.327
[2, 12000] loss: 1.275
Finished Training
Accuracy of the network on the 10000 test images: 55 %
```

## Transform CIFAR10 DL training code to FL including best model selection using Client API

Now we have a CIFAR10 DL training code, let's transform it to FL with NVFLARE Client API.


We make the following changes:

1. Import NVFlare Client API: ```import nvflare.client as flare```
2. Initialize NVFlare Client API: ```flare.init()```
3. Receive aggregated/global FLModel from NVFlare side: ```input_model = flare.receive()```
4. Load the received aggregated/global model weights into model structure: ```net.load_state_dict(input_model.params)```
5. Wrap evaluation logic into a method to re-use for evaluation on both trained and received aggregated/global model
6. Evaluate on received aggregated/global model to get the metrics for model selection
7. Construct the FLModel to be returned to the NVFlare side: ```output_model = flare.FLModel(xxx)```
8. Send the model back to NVFlare: ```flare.send(output_model)```

Optional: Change the data path to an absolute path and use ```prepare_data.sh``` to download data

The modified code can be found in [./codes/cifar10_client_api.py](./codes/cifar10_client_api.py)

After we modify our training script, we need to put it into a [job structure](https://nvflare.readthedocs.io/en/latest/real_world_fl/job.html) so that NVFlare system knows how to deploy and run the job.

Please refer to [JOB CLI tutorial](../../tutorials/job_cli.ipynb) on how to generate a job easily from our existing job templates.

We choose the [sag_pt job template](../../../job_templates/sag_pt/) and run the following command:

```bash
nvflare config -jt ../../../job_templates/
nvflare job list_templates
nvflare job create -force -j ./jobs/client_api -w sag_pt -sd ./codes/ -s ./codes/cifar10_client_api.py
```

Note that we have already created the [client_api job folder](./jobs/client_api/)

Now we have re-write our code and created the [client_api job folder](./jobs/client_api/), we can run it using NVFlare Simulator:

```bash
./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/client_api
```

Congratulations! You have finished an FL training!

## The Decorator use case

The above case show how you can change non-structured DL code to FL.

Usually people have already put their codes into "train", "evaluate", "test" methods so they can reuse.
In that case, the NVFlare DL2FL decorator is the way to go.

To structure the code, we make the following changes to [./codes/cifar10_tutorial_clean.py](./codes/cifar10_tutorial_clean.py):

1. Wrap training logic into a train method
2. Wrap evaluation logic into an evaluate method
3. Call train method and evaluate method

The result is [cifar10_tutorial_structured.py](./codes/cifar10_tutorial_structured.py)

To modify this structured code to be used in FL.
We do the following changes:

1. Import NVFlare Client API: ```import nvflare.client as flare```
2. Initialize NVFlare Client API: ```flare.init()```
3. Modify the train method:
- Decorate with ```@flare.train```
- Take additional argument in the beginning
- Load the received aggregated/global model weights into net: ```net.load_state_dict(input_model.params)```
- Return an FLModel object
4. Add an ```fl_evaluate``` method:
- Decorate with ```@flare.evaluate```
- First argument is input FLModel
- Return a float number of metric
5. Call ```fl_evaluate``` method before training to get metrics on received aggregated/global model

Optional: Change the data path to an absolute path and use ```prepare_data.sh``` to download data

The modified code can be found in [./codes/cifar10_decorator.py](./codes/cifar10_decorator.py)

Then we can create the job and run it using simulator:

```bash
nvflare job create -force -j ./jobs/decorator -w sag_pt -sd ./codes/ -s ./codes/cifar10_decorator.py
./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/decorator
```

## Transform CIFAR10 lightning training code to FL with NVFLARE Client lightning integration API

If you are using lightning framework to write your training scripts, you can use our NVFlare lightning client API to convert it into FL.

Given a CIFAR10 lightning code example: [./codes/cifar10_tutorial_lightning.py](./codes/cifar10_tutorial_lightning.py).
Notice we wrap the [Net class](./codes/net.py) into LightningModule: [LitNet class](./codes/lit_net.py)

To transform the existing code to FL training code, we made the following changes:

1. Import NVFlare Lightning Client API: ```import nvflare.client.lightning as flare```
2. Patch the lightning trainer ```flare.patch(trainer)```
3. Call trainer.evaluate() method to evaluate newly received aggregated/global model. The resulting evaluation metric will be used for best model selection

The modified code can be found in [./codes/cifar10_lightning.py](./codes/cifar10_lightning.py)

Then we can create the job using sag_pt template:

```bash
nvflare job create -force -j ./jobs/lightning -w sag_pt -sd ./codes/ -s ./codes/cifar10_lightning.py
```

We need to modify the "key_metric" in "config_fed_server.conf" from "accuracy" to "val_acc_epoch" (this name originates from the code [here](./codes/lit_net.py#L56)) which means the validation accuracy for that epoch:

```
{
id = "model_selector"
name = "IntimeModelSelector"
args {
key_metric = "val_acc_epoch"
}
}
```

And we modify the model architecture to use the LitNet class:

```
{
id = "persistor"
path = "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor"
args {
model {
path = "lit_net.LitNet"
}
}
}
```

Finally we run it using simulator:

```bash
./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/lightning
```
Loading

0 comments on commit 5dd6f9e

Please sign in to comment.