This toolkit can be used to run Distributed and Federated experiments. This project makes use of Pytorch Distributed (Data Parallel) (docs) as well as Kubernetes, KubeFlow (Pytorch-Operator) (docs) and Helm (docs) for deployment. The goal of this project is to launch Federated Learning nodes in a true distribution fashion, with simple deployments using proven technology.
This project builds on the work by Bart Cox, on the Federated Learning toolkit developed to run with Docker and Docker Compose (repo)
This project is tested with Ubuntu 20.04 and Arch Linux and Python {3.7, 3.8, 3.9}.
Pytorch Distributed works based on a world_size
and rank
s. The ranks should be between 0
and world_size-1
.
Generally, the process leading the learning process has rank 0
and the clients have ranks [1,..., world_size-1]
.
Currently, it is assumed that Distributed Learning is performed (and not Federated Learning), however, future
extension of the project is planned to implement a FederatedClient
that allows for a more realistic simulation of
Federated Learning experiments.
General protocol:
- Client creation and spawning by the Orchestrator (using KubeFlows Pytorch-Operator)
- Clients prepare needed data and model and synchronize using PyTorch Distributed.
WORLD_SIZE = 1
: Client performs training locally.WORLD_SIZE > 1
: Clients run epochs with DistributedDataParallel together.- (FUTURE: ) Your federated learning experiment.
- Client logs/reports progress during and after training.
Important notes:
- Data between clients (
WORLD_SIZE > 1
) is not shared - Hardware can heterogeneous
- The location of devices matters (network latency and bandwidth)
- Communication can be costly
When deploying the system, the following diagram shows how the system operates. PyTorchJob
s are launched by the
Orchestrator (see the Orchestrator charts). The Extractor keeps track of progress (see the
Extractor charts).
The PyTorchJob
s can consist on a variable number of machines, with different hardware for the Master/Leader node and the
Client nodes. KubeFlow (not depicted) orchestrates the deployment of the PyTorchJob
s.
It might be that something is missing, please open a pull request/issue).
Structure with important folders and files explained:
project
├── charts # Templates for deploying projects with Helm
│ ├── extractor - Template for 'extractor' for centralized logging (using NFS)
│ └── orchestrator - Template for 'orchestrator' for launching distributed experiments
├── configs # General configuration files
│ ├── quantities - Files for (Kubernetes) quantity conversion
│ └── tasks - Files for experiment description
├── data # Directory for default datasets (for a reduced load on hosting providers)
├── fltk # Source code
│ ├── datasets - Datasets (by default provide Pytorchs Distributed Data-Parallel)
│ ├── nets - Default models
│ ├── schedulers - Learningrate schedulers
│ ├── strategy - (Future) Basic strategies for Federated Learning experiments
│ └── util - Helper utilities
│ ├── cluster * Cluster interaction (Job Creation and monitoring)
│ ├── config * Configuration file loading
│ └── task * Arrival/TrainTask generation
└── logging # Default logging location
- Cifar10-CNN (CIFAR10CNN)
- Cifar10-ResNet
- Cifar100-ResNet
- Cifar100-VGG
- Fashion-MNIST-CNN
- Fashion-MNIST-ResNet
- Reddit-LSTM
- CIFAR10
- Cifar100
- Fashion-MNIST
- MNIST
The following tools need to be set up in your development environment before working with the (Kubernetes) FLTK.
- Hard requirements
- Local execution (single machine):
- MiniKube (docs)
- It must be noted that certain functionality might require additional steps to work on MiniKube. This is currently untested.
- MiniKube (docs)
- Google Cloud Environment (GKE) execution:
- GCloud SDK (docs)
- Your own cluster provider:
- A Kubernetes cluster supporting Kubernetes 1.16+.
Before continuing a deployment, first, the used datasets need to be downloaded. This is done to prevent the need for downloading each dataset for each container. Per default, these models are included in the Docker container that gets deployed on a Kubernetes Cluster.
To download the models, execute the following command from the project root.
python3 -m fltk extractor ./configs/example_cloud_experiment.json
This deployment guide will provide the general process of deploying an example deployment on the created cluster. It is assumed that you have already set up a cluster (or emulation tool like MiniKube to execute the commands locally).
N.B. This setup expects the NodePool on which you want to run training experiments, to have Taints, this should be set for the selected nodes. For more information on GKE see docs.
In this project we assume the following taint to be set, this can also be done using kubectl
for each node.
fltk.node=normal:NoSchedule
Programmatically, the following V1Toleration
allows pods to be scheduled on such 'tainted' nodes, regardless of the value
for fltk.node
.
from kubernetes.client import V1Toleration
V1Toleration(key="fltk.node",
operator="Exists",
effect="NoSchedule")
For a more strict Toleration (specific to a value), the following V1Toleration
should be generated.
V1Toleration(key="fltk.node",
operator="Equals",
value="normal",
effect='NoSchedule')
For more information on the programmatic creation of PyTorchJobs
to spawn on a cluster, refer to the
DeploymentBuilder
found in ./fltk/util/cluster/client.py
and the function
construct_job
.
Currently, this guide was tested to result in a working FLTK setup on GKE and MiniKube.
The guide is structured as follows:
- (Optional) Setup a Kubernetes Dashboard instance for monitoring
- Install KubeFlow's Pytorch-Operator (in a bare minimum configuration).
- KubeFlow is used to create and manage Training jobs for Pytorch Training jobs. However, you can also extend the work by making use of KubeFlows TF-Operator, to make use of Tensorflow.
- (Optional) Deploy KubeFlow PyTorch Job using an example project.
- Install an NFS server.
- To simplify FLTK's deployment, an NFS server is used to allow for the creation of
ReadWriteMany
volumes in Kubernetes. These volumes are, for example, used to create a centralized logging point, that allows for easy extraction of data from theExtractor
pod.
- To simplify FLTK's deployment, an NFS server is used to allow for the creation of
- Setup and install the
Extractor
pod.- The
Extractor
pod is used to create the required volume claims, as well as create a single access point to gain insight into the training process. Currently, it spawns a pod that runs the aTensorboard
instance, as aSummaryWriter
is used to record progress in aTensorboard
format. These are written to aReadWriteMany
mounted on a pods$WORKING_DIR/logging
by default during execution.
- The
- Deploy a default FLTK experiment.
Kubernetes Dashboard provides a comprehensive interface into some metrics, logs and status information of your cluster and the deployments it's running. To setup this dashboard, Helm can be used as follows:
helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
helm install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard
After setup completes, running the following commands (in case you change the release name to something different, you can
fetch the command using helm status your-release-name --namespace optional-namespace-name
) to connect to your Kubernetes
Dashboard.
export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=kubernetes-dashboard,app.kubernetes.io/instance=kubernetes-dashboard" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8443:8443
Then browsing to https://localhost:8443 on your machine will connect you to the Dashboard instance. Note that the certificate is self-signed of the Kubernetes Dashboard, so your browser may give warnings that the site is unsafe.
Kubeflow is an ML toolkit that allows to for a wide range of distributed machine and deep learning operations on Kubernetes clusters. FLTK makes use of the 1.3 release. We will deploy a minimal configuration, following the documentation of KubeFlows manifests repository. If you have already setup KubeFlow (and PyTorch-Operator) you may want to skip this step.
git clone https://github.com/kubeflow/manifests.git --branch=v1.3-branch
cd manifests
You might want to read the README.md
file for more information. Using Kustomize, we will install the default configuration
files for each KubeFlow component that is needed for a minimal setup. If you have already worked with KubeFlow on GKE
you might want to follow the GKE deployment on the official KubeFlow documentation. This will, however, result in a slightly
higher strain on your cluster, as more components will be installed.
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
# Wait before executing the following command, as
kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-install/base | kubectl apply -f -
kustomize build common/dex/overlays/istio | kubectl apply -f -
kustomize build common/oidc-authservice/base | kubectl apply -f -
kustomize build common/knative/knative-serving/base | kubectl apply -f -
kustomize build common/istio-1-9/cluster-local-gateway/base | kubectl apply -f -
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
kustomize build common/kubeflow-roles/base | kubectl apply -f -
kustomize build common/istio-1-9/kubeflow-istio-resources/base | kubectl apply -f -
kustomize build apps/pytorch-job/upstream/overlays/kubeflow | kubectl apply -f -
In case you want to test your KubeFlow deployment, an example training job can be run. For this, an example project of the pytorch-operator repository can be used.
git checkout https://github.com/kubeflow/pytorch-operator.git
cd pytorch-operator/examples/mnist
Follow the README.md
instructions, and make sure to rename the image name in pytorch-operator/examples/mnist/v1/pytorch_job_mnist_gloo.yaml
(line 33 and 35), to your project on GCE. Also commend out the resource
descriptions in lines 20-22 and 36-38. Otherwise
jobs require GPU support to run.
Build and push the Docker container, and execute the command to launch your first PyTorchJob on your cluster.
kubectl create -f ./v1/pytorch_job_mnist_gloo.yaml
Create your namespace in your cluster, that will later be used to deploy experiments. This guide (and the default
setup of the project) assumes that the namespace test
is used. To create a namespace, run the following command with your cluster credentials set up before running these commands.
kubectl namespace create test
During the execution, ReadWriteMany
persistent volumes are needed. This is because each training processes master
pod uses aSummaryWriter
to log the training progress. As such, multiple containers on potentially different nodes require
read-write access to a single volume. One way to resolve this is to make use of Google Firestore (or
equivalent on your service provider of choice). However, this will incur significant operating costs, as operation starts at 1 TiB (~200 USD per month). As such, we will deploy our own a NFS on our cluster.
In case this does not need your scalability requirements, you may want to set up a (sharded) CouchDB instance, and use that as a data store. This is not provided in this guide.
For FLTK, we make use of the nfs-server-provisioner
Helm chart created by kvaps
, which neatly wraps this functionality in an easy
to deploy chart. Make sure to install the NFS server in the same namespace as where you want to run your experiments.
Running the following commands will deploy a nfs-server
instance (named nfs-server
) with the default configuration.
In addition, it creates a Persistent Volume of 20 Gi
, allowing for 20 Gi
ReadWriteMany
persistent volume claims.
You may want to change this amount, depending on your need. Other service providers, such as DigitalOcean, might require the
storageClass
to be set to do-block-storage
instead of default
.
helm repo add kvaps https://kvaps.github.io/charts
helm update
helm install nfs-server kvaps/nfs-server-provisioner --namespace test --set persistence.enabled=true,persistence.storageClass=standard,persistence.size=20Gi
To create a Persistent Volume (for a Persistent Volume Claim), the following syntax should be used, similar to the Persistent Volume description provided in ./charts/extractor/templates/fl-log-claim-persistentvolumeclaim.yaml. Which creates a Persistent Volume that uses the values provided in ./charts/fltk-values.yaml.
N.B. If you wish to use a Volume as both ReadWriteOnce and ReadOnlyMany, GCE does NOT provide this functionality You'll need to either create a ReadWriteMany Volume with read-only Claims, or ensure that the writer completes before the readers are spawned (and thus allowing for ReadWriteOnce to be allowed during deployment). For more information consult the Kubernetes and GKE Kubernetes
On your remote cluster, you need to have set up a docker registry. For example, Google provides the Google Container Registry
(GCR). In this example, we will make use of GCR, to push our container to a project test-bed-distml
under the tag fltk
.
This requires you to have enabled the GCR in your GCE project beforehand. Make sure that your docker installation supports
Docker Buildkit, or remove the DOCKER_BUILDKIT=1
part from the command before running (this might require additional changes
in the Dockerfile).
DOCKER_BUILDKIT=1 docker build . --tag gcr.io/test-bed-distml/fltk
docker push gcr.io/test-bed-distml/fltk
N.B. when running in Minikube, you can also set up a local registry. An example of how this can be quickly achieved can be found in this Medium post by Shashank Srivastava.
This section only needs to be run once, as this will set up the TensorBoard service, as well as create the Volumes needed
for the deployment of the Orchestrator
's chart. It does, however, require you to have pushed the docker container to a
registry that can be accessed from your Cluster.
N.B. that removing the Extractor
chart will result in the deletion of the Persistent Volumes once all Claims are
released. This will remove the data that is stored on these volumes. Make sure to copy the contents of these directories to your local file system before uninstalling the Extractor
Helm chart. The following commands deploy the Extractor
Helm chart, under the name extractor
in the test
namespace.
cd charts
helm install extractor -f values.yaml --namespace test
And wait for it to deploy. (Check with helm ls --namespace test
)
N.B. To download data from the Extrator
node (which mounts the logging director), the following kubectl
command can be used. This will download the data in the logging directory to your file system. Note that downloading
many small files is slow (as they will be compressed individually). The command assumes that the default name is used
fl-extractor
.
kubectl cp --namespace test fl-extractor:/opt/federation-lab/logging ./logging
We have now completed the setup of the project and can continue by running actual experiments. If no errors occur, this should. You may also skip this step and work on your code, but it might be good to test your deployment before running into trouble later.
cd charts
helm install flearner ./orchestrator --namespace test -f fltk-values.yaml
This will spawn an fl-server
Pod in the test
Namespace, which will spawn Pods (using V1PyTorchJobs
), that
run experiments. It will currently make use of the configs/example_cloud_experiment.json
default configuration. As described in the values file of the Orchestrator
s Helm chart
- Currently, there is no GPU support in the Docker containers.