-
Notifications
You must be signed in to change notification settings - Fork 137
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #36 from IBM/version1.0.2
Version 1.0.2 of IBM Federated Learning library
- Loading branch information
Showing
11 changed files
with
118 additions
and
132 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Enabling GPU training | ||
|
||
IBM federated learning offers support for training neural network models | ||
under GPU environment at the party side to speedup the training process. | ||
|
||
## Environment setup | ||
Please install required libraries for GPU training. | ||
- For Keras and TensorFlow models, install the corresponding `tensorflow-gpu` package | ||
according to [Tensorflow GPU tutorial](https://www.tensorflow.org/install/gpu). | ||
IBM FL currently requires `tensorflow==1.15.0`, therefore, | ||
you will need to install `tensorflow-gpu==1.15.0` in your GPU environment. | ||
|
||
## IBM FL configuration | ||
Users can enable and specify the number of GPUs they want to use for training | ||
via the party's configuration file. | ||
Below is an example of the party's configuration file: | ||
```yaml | ||
aggregator: | ||
ip: 127.0.0.1 | ||
port: 5000 | ||
connection: | ||
info: | ||
ip: 127.0.0.1 | ||
port: 8085 | ||
tls_config: | ||
enable: false | ||
name: FlaskConnection | ||
path: ibmfl.connection.flask_connection | ||
sync: false | ||
data: | ||
info: | ||
npz_file: examples/data/mnist/random/data_party0.npz | ||
name: MnistKerasDataHandler | ||
path: ibmfl.util.data_handlers.mnist_keras_data_handler | ||
local_training: | ||
name: LocalTrainingHandler | ||
path: ibmfl.party.training.local_training_handler | ||
model: | ||
name: KerasFLModel | ||
path: ibmfl.model.keras_fl_model | ||
spec: | ||
model_definition: examples/configs/keras_classifier/compiled_keras.h5 | ||
model_name: keras-cnn | ||
info: | ||
gpu: | ||
num_gpus: 2 # enabling keras training with 2 GPUs | ||
protocol_handler: | ||
name: PartyProtocolHandler | ||
path: ibmfl.party.party_protocol_handler | ||
``` | ||
In the above example, the `gpu` section under `info` section of `model` specifies | ||
the `gpu` setting of party's local training. | ||
Users can change the `num_gpus` according to the computing resources available to the parties. | ||
|
||
If no `gpu` section is presented in `info`, the Keras/TensorFlow.keras training will be | ||
using the default CPU environment or **only one GPU** even if the party can access one or more GPU(s). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Quorum handling and ability to Rejoin | ||
|
||
## Quorum handling | ||
IBM FL supports the functionality to specify quorum percentage in the aggregator config file to provide flexibility to parties that have potential connectivity failure. Given a total number of parties registered at a particular round, the quorum percentage defines the minimum number of parties that should reply back for that round. If for some round aggregator receives less number of replies from the parties, it will stop the federated learning process. This functionality makes sure that if for some reasons a number of parties dropout they can rejoin back as long as the available parties do not fall below the quorum value. | ||
|
||
For example in following configuration file `perc_quorum` is set to 0.75. This means that for each round aggregator will expect 75% of the registered parties to reply back. So if there are 20 parties that registered, federated learning will continue as long as not more than five parties drop out. | ||
|
||
``` | ||
hyperparams: | ||
global: | ||
max_timeout: 60 | ||
num_parties: 5 | ||
perc_quorum: 0.75 | ||
rounds: 3 | ||
termination_accuracy: 0.9 | ||
``` | ||
|
||
## Maximum Timeout and Rejoin | ||
Users can specify the maximum timeout (in seconds) aggregator should wait for parties to reply back in the aggregator configuration file. If `max_timeout` value is specified, aggregator will wait for specified amount of time to check if the required number of parties (calculated based on the quorum percentage provided earlier) have replied back or not. Please note that if quorum percentage is not specified aggregator will expect the value to be 100% and expect reply from all the registered parties. Similarly, if maximum timeout is not specified aggregator will wait forever for parties to reply back. | ||
|
||
To rejoin party just needs to issue START and REGISTER commands like it did initially to join federated learning process. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Binary file removed
BIN
-154 KB
federated-learning-lib/federated_learning_lib-1.0.1-py3-none-any.whl
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters