Merge branch 'admin_big_file' of github.com:yanchengnv/NVFlare into a…

…dmin_big_file
yanchengnv · Sep 12, 2023 · 5dd6f9e · 5dd6f9e
2 parents 9771d36 + f14ebfe
commit 5dd6f9e
Show file tree

Hide file tree

Showing 189 changed files with 15,088 additions and 904 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -70,7 +70,7 @@ preferred-citation:
   - family-names: Feng
     given-names:  Andrew
   doi: "https://doi.org/10.48550/arXiv.2210.13291"
-  journal: "International Workshop on Federated Learning, NeurIPS 2022, New Orleans, USA"
-  month: 10
+  journal: "IEEE Data Eng. Bull., Vol. 46, No. 1"
+  month: 3
   title: "NVIDIA FLARE: Federated Learning from Simulation to Real-World"
-  year: 2022
+  year: 2023
diff --git a/docs/publications_and_talks.rst b/docs/publications_and_talks.rst
@@ -7,9 +7,14 @@ Publications
 Non-exhaustive list of papers and publications related to NVIDIA FLARE, 
 including papers using NVIDIA FLARE's predecessor libraries included in the `Clara Train SDK <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/clara-train-sdk>`__.
 
+Publications: 2023
+------------------
+* **2023-10** `ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data <https://arxiv.org/abs/2308.04070>`__ (`DeCaF @ MICCAI 2023 <https://decaf-workshop.github.io/decaf-2023/>`__)
+* **2023-06** `Fair Federated Medical Image Segmentation via Client Contribution Estimation <https://arxiv.org/abs/2303.16520>`__ (`CVPR 2023 <https://cvpr2023.thecvf.com/Conferences/2023/>`__)
+* **2023-03** `FLARE: Federated Learning from Simulation to Real-World <https://arxiv.org/abs/2210.13291>`__ (`IEEE Data Eng. Bull. March 2023, Vol. 46, No. 1, <http://sites.computer.org/debull/A23mar/issue1.htm>`__)
+
 Publications: 2022
 ------------------
-* **2022-12** `FLARE: Federated Learning from Simulation to Real-World <https://arxiv.org/abs/2210.13291>`__ (`International Workshop on Federated Learning, NeurIPS 2022, New Orleans, USA <https://federated-learning.org/fl-neurips-2022>`__)
 * **2022-10** `Auto-FedRL: Federated Hyperparameter Optimization for Multi-institutional Medical Image Segmentation <https://arxiv.org/abs/2203.06338>`__ (`ECCV 2022 <https://eccv2022.ecva.net/>`__)
 * **2022-10** `Joint Multi Organ and Tumor Segmentation from Partial Labels Using Federated Learning <https://link.springer.com/chapter/10.1007/978-3-031-18523-6_6>`__ (`DeCaF @ MICCAI 2022 <https://decaf-workshop.github.io/decaf-2022/>`__)
 * **2022-10** `Split-U-Net: Preventing Data Leakage in Split Learning for Collaborative Multi-modal Brain Tumor Segmentation <https://arxiv.org/abs/2208.10553>`__ (`DeCaF @ MICCAI 2022 <https://decaf-workshop.github.io/decaf-2022/>`__)

diff --git a/examples/hello-world/README.md b/examples/hello-world/README.md
@@ -18,6 +18,9 @@ Before you run the notebook, the following preparation work must be done:
 * [Hello world notebook](./hello_world.ipynb)
 
 ## Hello World Examples
+### Easier ML/DL to FL transition
+* [ML to FL](./ml-to-fl/README.md): Showcase how to convert existing ML/DL codes to a NVFlare job.
+
 ### Workflows
 * [Hello Scatter and Gather](./hello-numpy-sag/README.md)
     * Example using "[ScatterAndGather](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.app_common.workflows.scatter_and_gather.html)" controller workflow.

diff --git a/examples/hello-world/hello-ccwf/jobs/numpy-swcse/app/config/config_fed_client.json b/examples/hello-world/hello-ccwf/jobs/numpy-swcse/app/config/config_fed_client.json
@@ -0,0 +1,71 @@
+{
+  "format_version": 2,
+  "executors": [
+    {
+      "tasks": [
+        "train", "submit_model", "validate"
+      ],
+      "executor": {
+        "path": "nvflare.app_common.ccwf.comps.np_trainer.NPTrainer",
+        "args": {}
+      }
+    },
+    {
+      "tasks": ["swarm_*"],
+      "executor": {
+        "path": "nvflare.app_common.ccwf.SwarmClientController",
+        "args": {
+          "learn_task_name": "train",
+          "learn_task_timeout": 5.0,
+          "persistor_id": "persistor",
+          "aggregator_id": "aggregator",
+          "shareable_generator_id": "shareable_generator",
+          "min_responses_required": 2,
+          "wait_time_after_min_resps_received": 1
+        }
+      }
+    },
+    {
+      "tasks": ["cse_*"],
+      "executor": {
+        "path": "nvflare.app_common.ccwf.CrossSiteEvalClientController",
+        "args": {
+          "submit_model_task_name": "submit_model",
+          "validation_task_name": "validate",
+          "persistor_id": "persistor"
+        }
+      }
+    }
+  ],
+  "task_result_filters": [],
+  "task_data_filters": [],
+  "components": [
+    {
+      "id": "persistor",
+      "path": "nvflare.app_common.ccwf.comps.np_file_model_persistor.NPFileModelPersistor",
+      "args": {}
+    },
+    {
+      "id": "shareable_generator",
+      "path": "nvflare.app_common.ccwf.comps.simple_model_shareable_generator.SimpleModelShareableGenerator",
+      "args": {}
+    },
+    {
+      "id": "aggregator",
+      "path": "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator",
+      "args": {
+        "expected_data_kind": "WEIGHT_DIFF"
+      }
+    },
+    {
+      "id": "model_selector",
+      "path": "nvflare.app_common.ccwf.comps.simple_intime_model_selector.SimpleIntimeModelSelector",
+      "args": {}
+    },
+    {
+      "id": "result_printer",
+      "path": "nvflare.app_common.ccwf.comps.cwe_result_printer.CWEResultPrinter",
+      "args": {}
+    }
+  ]
+}
diff --git a/examples/hello-world/hello-ccwf/jobs/numpy-swcse/app/config/config_fed_server.json b/examples/hello-world/hello-ccwf/jobs/numpy-swcse/app/config/config_fed_server.json
@@ -0,0 +1,26 @@
+{
+  "format_version": 2,
+  "task_data_filters": [],
+  "task_result_filters": [],
+  "components": [
+    {
+      "id": "json_generator",
+      "name": "ValidationJsonGenerator",
+      "args": {}
+    }
+  ],
+  "workflows": [
+    {
+      "id": "swarm_controller",
+      "path": "nvflare.app_common.ccwf.SwarmServerController",
+      "args": {
+        "num_rounds": 3
+      }
+    },
+    {
+      "id": "cross_site_eval",
+      "path": "nvflare.app_common.ccwf.CrossSiteEvalServerController",
+      "args": {}
+    }
+  ]
+}
diff --git a/...world/ml-to-fl/jobs/client_api1/meta.json → ...rld/hello-ccwf/jobs/numpy-swcse/meta.json b/...world/ml-to-fl/jobs/client_api1/meta.json → ...rld/hello-ccwf/jobs/numpy-swcse/meta.json
@@ -1,5 +1,5 @@
 {
-  "name": "subprocess with file pipe with pytorch",
+  "name": "hello-numpy-sw-cse",
   "resource_spec": {},
   "min_clients" : 2,
   "deploy_map": {

diff --git a/examples/hello-world/ml-to-fl/README.md b/examples/hello-world/ml-to-fl/README.md
@@ -1,27 +1,202 @@
-# ML to FL transition with NVFlare
+# Deep Learning to Federated Learning transition with NVFlare
 
-Converting Machine Learning or Deep Learning to FL is not easy, as it involves:
+Converting Deep Learning (DL) to Federated Learning (FL) is not easy, as it involves:
 
-1. Algorithms formulation, how to formulate an ML/DL to FL algorithm and what information needs to be pass between Client and Server
+1. Algorithms formulation, how to formulate a DL to FL algorithm and what information needs to be pass between Client and Server
 
-2. Convert existing standalone, centralized ML/DL code to FL code.
+2. Convert existing standalone, centralized DL code to FL code.
 
 3. Configure the workflow to use the newly changed code.
 
-In this example, we assume #1 algorithm formulation is fixed (FedAvg).
-We are showing #2, that is how to quickly convert the centralized DL to FL.
+In this example, we assume algorithm formulation is fixed (FedAvg).
+We are showing how to quickly convert the centralized DL to FL.
 We will demonstrate different techniques depending on the existing code structure and preferences.
 
-For #3 one can reference to the config we have here and the documentation.
+To configure the workflow, one can reference to the config we have here and the documentation.
 
 In this directory, we are providing job configurations to showcase how to utilize 
 `LauncherExecutor`, `Launcher` and several NVFlare interfaces to simplify the
-transition from your ML code to FL with NVFlare.
+transition from your DL code to FL with NVFlare.
 
+We will demonstrate how to transform an existing DL code into FL application step-by-step:
 
-## Examples
+  1. Show a base line training script [the base line](#the-base-line)
+  2. How to modify a non-structured script using DL2FL Client API [the Client API usage example](#transform-cifar10-dl-training-code-to-fl-including-best-model-selection-using-client-api)
+  3. How to modify a structured script using DL2FL decorator [the decorator usage example](#the-decorator-use-case)
+  4. How to modify a structured "lightning" script using DL2FL Lightning Client API [the lightning use case](#transform-cifar10-lightning-training-code-to-fl-with-nvflare-client-lightning-integration-api)
 
-- [client_api1](./jobs/client_api1/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API
-- [client_api2](./jobs/client_api2/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API with model selection
-- [decorator](./jobs/decorator/): Re-write CIFAR10 PyTorch example to federated learning job using NVFlare client API decorator
-- [lightning_client_api](./jobs/lightning_client_api/): Re-write PyTorch Lightning auto-encoder example to federated learning job using NVFlare lightning client API
+## The base line
+
+We take a CIFAR10 example directly from [PyTorch website](https://github.com/pytorch/tutorials/blob/main/beginner_source/blitz/cifar10_tutorial.py) and do the following cleanups to get [cifar10_tutorial_clean.py](./codes/cifar10_tutorial_clean.py):
+
+1. Remove the comments
+2. Move the definition of Convolutional Neural Network to [net.py](./codes/net.py)
+3. Wrap the whole code inside a main method (https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods)
+4. Add the ability to run on GPU to speed up the training process (optional)
+
+You can run the baseline using
+
+```bash
+python3 ./codes/cifar10_tutorial_clean.py
+```
+
+It will run for 2 epochs and you will see something like:
+
+```bash
+Extracting ./data/cifar-10-python.tar.gz to ./data
+Files already downloaded and verified
+[1,  2000] loss: 2.127
+[1,  4000] loss: 1.826
+[1,  6000] loss: 1.667
+[1,  8000] loss: 1.568
+[1, 10000] loss: 1.503
+[1, 12000] loss: 1.455
+[2,  2000] loss: 1.386
+[2,  4000] loss: 1.362
+[2,  6000] loss: 1.348
+[2,  8000] loss: 1.329
+[2, 10000] loss: 1.327
+[2, 12000] loss: 1.275
+Finished Training
+Accuracy of the network on the 10000 test images: 55 %
+```
+
+## Transform CIFAR10 DL training code to FL including best model selection using Client API
+
+Now we have a CIFAR10 DL training code, let's transform it to FL with NVFLARE Client API.
+
+
+We make the following changes:
+
+1. Import NVFlare Client API: ```import nvflare.client as flare```
+2. Initialize NVFlare Client API: ```flare.init()```
+3. Receive aggregated/global FLModel from NVFlare side: ```input_model = flare.receive()```
+4. Load the received aggregated/global model weights into model structure: ```net.load_state_dict(input_model.params)```
+5. Wrap evaluation logic into a method to re-use for evaluation on both trained and received aggregated/global model
+6. Evaluate on received aggregated/global model to get the metrics for model selection
+7. Construct the FLModel to be returned to the NVFlare side: ```output_model = flare.FLModel(xxx)```
+8. Send the model back to NVFlare: ```flare.send(output_model)```
+
+Optional: Change the data path to an absolute path and use ```prepare_data.sh``` to download data
+
+The modified code can be found in [./codes/cifar10_client_api.py](./codes/cifar10_client_api.py)
+
+After we modify our training script, we need to put it into a [job structure](https://nvflare.readthedocs.io/en/latest/real_world_fl/job.html) so that NVFlare system knows how to deploy and run the job.
+
+Please refer to [JOB CLI tutorial](../../tutorials/job_cli.ipynb) on how to generate a job easily from our existing job templates.
+
+We choose the [sag_pt job template](../../../job_templates/sag_pt/) and run the following command:
+
+```bash
+nvflare config -jt ../../../job_templates/
+nvflare job list_templates
+nvflare job create -force -j ./jobs/client_api -w sag_pt -sd ./codes/ -s ./codes/cifar10_client_api.py
+```
+
+Note that we have already created the [client_api job folder](./jobs/client_api/)
+
+Now we have re-write our code and created the [client_api job folder](./jobs/client_api/), we can run it using NVFlare Simulator:
+
+```bash
+./prepare_data.sh
+nvflare simulator -n 2 -t 2 ./jobs/client_api
+```
+
+Congratulations! You have finished an FL training!
+
+## The Decorator use case
+
+The above case show how you can change non-structured DL code to FL.
+
+Usually people have already put their codes into "train", "evaluate", "test" methods so they can reuse.
+In that case, the NVFlare DL2FL decorator is the way to go.
+
+To structure the code, we make the following changes to [./codes/cifar10_tutorial_clean.py](./codes/cifar10_tutorial_clean.py):
+
+1. Wrap training logic into a train method
+2. Wrap evaluation logic into an evaluate method
+3. Call train method and evaluate method
+
+The result is [cifar10_tutorial_structured.py](./codes/cifar10_tutorial_structured.py)
+
+To modify this structured code to be used in FL.
+We do the following changes:
+
+1. Import NVFlare Client API: ```import nvflare.client as flare```
+2. Initialize NVFlare Client API: ```flare.init()```
+3. Modify the train method:
+    - Decorate with ```@flare.train```
+    - Take additional argument in the beginning
+    - Load the received aggregated/global model weights into net: ```net.load_state_dict(input_model.params)```
+    - Return an FLModel object
+4. Add an ```fl_evaluate``` method:
+    - Decorate with ```@flare.evaluate```
+    - First argument is input FLModel
+    - Return a float number of metric
+5. Call ```fl_evaluate``` method before training to get metrics on received aggregated/global model
+
+Optional: Change the data path to an absolute path and use ```prepare_data.sh``` to download data
+
+The modified code can be found in [./codes/cifar10_decorator.py](./codes/cifar10_decorator.py)
+
+Then we can create the job and run it using simulator:
+
+```bash
+nvflare job create -force -j ./jobs/decorator -w sag_pt -sd ./codes/ -s ./codes/cifar10_decorator.py
+./prepare_data.sh
+nvflare simulator -n 2 -t 2 ./jobs/decorator
+```
+
+## Transform CIFAR10 lightning training code to FL with NVFLARE Client lightning integration API
+
+If you are using lightning framework to write your training scripts, you can use our NVFlare lightning client API to convert it into FL.
+
+Given a CIFAR10 lightning code example: [./codes/cifar10_tutorial_lightning.py](./codes/cifar10_tutorial_lightning.py).
+Notice we wrap the [Net class](./codes/net.py) into LightningModule: [LitNet class](./codes/lit_net.py)
+
+To transform the existing code to FL training code, we made the following changes:
+
+1. Import NVFlare Lightning Client API: ```import nvflare.client.lightning as flare```
+2. Patch the lightning trainer ```flare.patch(trainer)```
+3. Call trainer.evaluate() method to evaluate newly received aggregated/global model. The resulting evaluation metric will be used for best model selection
+
+The modified code can be found in [./codes/cifar10_lightning.py](./codes/cifar10_lightning.py)
+
+Then we can create the job using sag_pt template:
+
+```bash
+nvflare job create -force -j ./jobs/lightning -w sag_pt -sd ./codes/ -s ./codes/cifar10_lightning.py
+```
+
+We need to modify the "key_metric" in "config_fed_server.conf" from "accuracy" to "val_acc_epoch" (this name originates from the code [here](./codes/lit_net.py#L56)) which means the validation accuracy for that epoch:
+
+```
+{
+  id = "model_selector"
+  name = "IntimeModelSelector"
+  args {
+    key_metric = "val_acc_epoch"
+  }
+}
+```
+
+And we modify the model architecture to use the LitNet class:
+
+```
+{
+  id = "persistor"
+  path = "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor"
+  args {
+    model {
+      path = "lit_net.LitNet"
+    }
+  }
+}
+```
+
+Finally we run it using simulator:
+
+```bash
+./prepare_data.sh
+nvflare simulator -n 2 -t 2 ./jobs/lightning
+```