Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into averager-libp2p
Browse files Browse the repository at this point in the history
  • Loading branch information
borzunov committed Jul 16, 2021
2 parents 2e51140 + 4a006ae commit 12e8039
Show file tree
Hide file tree
Showing 16 changed files with 605 additions and 233 deletions.
110 changes: 86 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,38 @@
## Hivemind: decentralized deep learning in PyTorch

[![CI status](https://github.com/learning-at-home/hivemind/actions/workflows/run-tests.yml/badge.svg?branch=master)](https://github.com/learning-at-home/hivemind/actions)
[![Documentation Status](https://readthedocs.org/projects/learning-at-home/badge/?version=latest)](https://learning-at-home.readthedocs.io/en/latest/?badge=latest)
[![Gitter](https://badges.gitter.im/learning-at-home/hivemind.svg)](https://gitter.im/learning-at-home/hivemind?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[![PyPI version](https://img.shields.io/pypi/v/hivemind.svg)](https://pypi.org/project/hivemind/)
[![Discord](https://img.shields.io/static/v1?style=default&label=Discord&logo=discord&message=join)](https://discord.gg/xC7ucM8j)
[![CI status](https://github.com/learning-at-home/hivemind/actions/workflows/run-tests.yml/badge.svg?branch=master)](https://github.com/learning-at-home/hivemind/actions)
![Codecov](https://img.shields.io/codecov/c/github/learning-at-home/hivemind)
[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage is training a
single Transformer model on hundreds of computers from different universities, companies, and volunteers.
Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one
large model on hundreds of computers from different universities, companies, and volunteers.

![img](https://i.imgur.com/GPxolxb.gif)

## Key Features

* Train neural networks of arbitrary size: parts of their layers are distributed across the participants.
* Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized
network.
* Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too
long to respond.
* Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to
synchronize across the entire network.
* Decentralized parameter averaging: iteratively aggregate updates from multiple
workers without the need to synchronize across the entire network ([paper](https://arxiv.org/abs/2103.03239)).
* Train neural networks of arbitrary size: parts of their layers are distributed across the participants with the
decentralized mixture-of-experts ([paper](https://arxiv.org/abs/2002.04013)).

To learn more about the ideas behind this library, see https://learning-at-home.github.io or read
the [NeurIPS 2020 paper](https://arxiv.org/abs/2002.04013).

## Installation

Before installing hivemind, make sure that your environment has Python 3.7+
and [PyTorch](https://pytorch.org/get-started/locally/#start-locally) with a version at least as new as 1.6.0.
Before installing, make sure that your environment has Python 3.7+
and [PyTorch](https://pytorch.org/get-started/locally/#start-locally) 1.6.0 or newer.
You can install them either natively or with [Anaconda](https://www.anaconda.com/products/individual).

To start using this library, you can either use the pip package manager or build it from source. Since currently the
release cycle is not established yet, we recommend installing hivemind from source to keep up with the latest bugfixes
and improvements.
You can install [the latest release](https://pypi.org/project/hivemind) with pip or build hivemind from source.

### With pip

Expand All @@ -42,7 +44,7 @@ pip install hivemind

### From source

To install hivemind from source, simply clone the repository and install
To install hivemind from source, simply run the following:

```
git clone https://github.com/learning-at-home/hivemind.git
Expand All @@ -53,11 +55,31 @@ pip install .
If you would like to verify that your installation is working properly, you can install with `pip install -e .[dev]`
instead. Then, you can run the tests with `pytest tests/`.

By default, hivemind uses the precompiled binary of
the [go-libp2p-daemon](https://github.com/learning-at-home/go-libp2p-daemon) library. If you face compatibility issues
or want to build the binary yourself, you can recompile it by running `pip install . --global-option="--buildgo"`.
Before running the compilation, please ensure that your machine has a recent version
of [Go toolchain](https://golang.org/doc/install) (1.15 or higher).

### System requirements
- __Linux__ is the default OS for which hivemind is developed and tested. We recommend Ubuntu 18.04+ (64-bit),
but other 64-bit distros should work as well. Legacy 32-bit is not recommended.
- __macOS 10.x__ mostly works but requires building hivemind from source, and some edge cases may fail.
To ensure full compatibility, we recommend using [our Docker image](https://hub.docker.com/r/learningathome/hivemind).
- __Windows 10+ (experimental)__ can run hivemind using [WSL](https://docs.microsoft.com/ru-ru/windows/wsl/install-win10).
You can configure WSL to use GPU following [this guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) by NVIDIA.
After the CUDA toolkit is installed you can simply follow the instructions above to install with pip or from source.

## Documentation

* [Quickstart](https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html): install hivemind, set up a
server and train experts
* Documentation & guides are available at [learning-at-home.readthedocs.io](https://learning-at-home.readthedocs.io)
* The [quickstart tutorial](https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html) walks through installation
and a training a simple neural network with several peers.
* [examples/albert](https://github.com/learning-at-home/hivemind/tree/master/examples/albert) contains the starter kit
and instructions for training a Transformer masked language model collaboratively.
* API reference and additional tutorials are available at [learning-at-home.readthedocs.io](https://learning-at-home.readthedocs.io)

If you have any questions about installing and using hivemind, you can ask them in
[our Discord chat](https://discord.gg/xC7ucM8j) or file an [issue](https://github.com/learning-at-home/hivemind/issues).

## Contributing

Expand All @@ -66,7 +88,7 @@ documentation improvements to entirely new features, is equally appreciated.

If you want to contribute to hivemind but don't know where to start, take a look at the
unresolved [issues](https://github.com/learning-at-home/hivemind/issues). Open a new issue or
join [our chat room](https://gitter.im/learning-at-home/hivemind) in case you want to discuss new functionality or
join [our chat room](https://discord.gg/xC7ucM8j) in case you want to discuss new functionality or
report a possible bug. Bug fixes are always welcome, but new features should be preferably discussed with maintainers
beforehand.

Expand All @@ -77,7 +99,7 @@ our [guide](https://learning-at-home.readthedocs.io/en/latest/user/contributing.

## Citation

If you found hivemind or its underlying algorithms useful for your experiments, please cite the following source:
If you found hivemind or its underlying algorithms useful for your research, please cite the relevant papers:

```
@misc{hivemind,
Expand All @@ -88,7 +110,8 @@ If you found hivemind or its underlying algorithms useful for your experiments,
}
```

Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired the creation of this library:
Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired the creation of this library
(prototype implementation of hivemind available at [mryab/learning-at-home](https://github.com/mryab/learning-at-home)):

```
@inproceedings{ryabinin2020crowdsourced,
Expand All @@ -104,10 +127,49 @@ Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired t
}
```

The initial implementation of hivemind used for the paper is available
at [mryab/learning-at-home](https://github.com/mryab/learning-at-home).
<details>
<summary>Additional publications</summary>

["Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices"](https://arxiv.org/abs/2103.03239)

```
@misc{ryabinin2021moshpit,
title={Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices},
author={Max Ryabinin and Eduard Gorbunov and Vsevolod Plokhotnyuk and Gennady Pekhimenko},
year={2021},
eprint={2103.03239},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

["Distributed Deep Learning in Open Collaborations"](https://arxiv.org/abs/2106.10207)

```
@misc{diskin2021distributed,
title={Distributed Deep Learning in Open Collaborations},
author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
year={2021},
eprint={2106.10207},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

["Secure Distributed Training at Scale"](https://arxiv.org/abs/2106.11257)

```
@misc{gorbunov2021secure,
title={Secure Distributed Training at Scale},
author={Eduard Gorbunov and Alexander Borzunov and Michael Diskin and Max Ryabinin},
year={2021},
eprint={2106.11257},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

In the documentation, we list
several [related](https://learning-at-home.readthedocs.io/en/latest/user/acknowledgements.html) projects and
acknowledgements.
</details>

We also maintain a list of [related projects and
acknowledgements](https://learning-at-home.readthedocs.io/en/latest/user/acknowledgements.html).
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ documentation below.

user/quickstart
modules/index
user/dht
user/moe
user/contributing
user/benchmarks
user/acknowledgements
Expand Down
20 changes: 10 additions & 10 deletions docs/user/acknowledgements.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Credits
# Acknowledgements

We kindly thank (in random order)
We kindly thank (in no particular order)

* [Artem Babenko](https://research.yandex.com/people/102794) and
[Vladimir Aliev](https://ru.linkedin.com/in/vladimir-aliev-19b93282) for helpful discussions and editorial review of
Expand All @@ -14,15 +14,15 @@ We kindly thank (in random order)
* [Brian Muller](https://github.com/bmuller/kademlia) for his implementations
of [kademlia](https://github.com/bmuller/kademlia) and [rpcudp](https://github.com/bmuller/rpcudp)
* Alexander Sherbakov for helpful discussions on PC and server component architecture,
* Our early adopters, [contributors](https://github.com/learning-at-home/hivemind/graphs/contributors), and reviewers
* [Yandex School of Data Analysis](https://yandexdataschool.com) students, for helping us run first truly collaborative experiments.
* The [Neuropark community](https://neuropark.co/), for hosting early collaborative training experiments of sahajBERT with hivemind.
* Our early adopters, [contributors](https://github.com/learning-at-home/hivemind/graphs/contributors), and conference reviewers.

# Related projects

We also want to reference several projects that have similar ideas in mind:

In this section, we list several organizations and research projects that bring humanity closer to the dream of world-scale deep learning with volunteer computing.
* [Hugging Face](https://huggingface.co) — an AI community with world-leading NLP research that builds collaborative hub training using hivemind.
* [EYDLE](https://www.eydle.com) — a start-up that works towards distributed deep learning on volunteer hardware using centralized infrastructure.
* [BitTensor](https://github.com/opentensor/BitTensor) — a decentralized deep learning ecosystem with incentive
mechanism. Like hivemind, but peers are getting rewarded for their contribution to other peers. .
* [GShard](https://arxiv.org/abs/2006.16668) — a paper by Dmitry Lepikhin et al. that demonstrate the effectiveness of
huge Mixture-of-Experts models on conventional hpc hardware. Those guys train models 4 times the size of GPT-3 on
thousands of TPUv3.
* Also doing research in decentralized deep learning? Let us know!
mechanism. Each peer trains for its own objective and rewards others for useful features.
* Also building collaborative deep learning? Let us know! `hivemind-team <at> hotmail.com`
135 changes: 135 additions & 0 deletions docs/user/dht.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Hivemind DHT

In order to coordinate, hivemind peers form a Distributed Hash Table: distributed "dictionary" where each peer
can store and get values. To initialize the first DHT node, run

```python
from hivemind import DHT, get_dht_time

dht = DHT(start=True)
# create the first DHT node that listens for incoming connections from localhost only

print("For incoming connections, use:", dht.get_visible_maddrs())
```

You can now start more peers that connect to an existing DHT node using its listen address:
```python
dht2 = DHT(initial_peers=dht.get_visible_maddrs(), start=True)
```

Note that `initial_peers` contains the address of the first DHT node.
This implies that the resulting node will have shared key-value with the first node, __as well as any other
nodes connected to it.__ When the two nodes are connected, subsequent peers can use any one of them (or both)
as `initial_peers` to connect to the shared "dictionary".

### Store/get operations

Once the DHT is formed, all participants can `dht.store` key-value pairs in the DHT and `dht.get` them by key:

```python
# first node: store a key-value pair for 600 seconds
store_ok = dht.store('my_key', ('i', 'love', 'bees'),
expiration_time=get_dht_time() + 600)

# second node: get the value stored by the first node
value, expiration = dht2.get('my_key', latest=True)
assert value == ('i', 'love', 'bees')
```

As you can see, each value in a hivemind DHT is associated with an expiration time,
computed current `get_dht_time()` with some offset.
This expiration time is used to cleanup old data and resolve write conflicts:
DHT nodes always prefer values with higher expiration time and may delete any value past its expiration.

### Values with subkeys

Hivemind DHT also supports a special value type that is itself a dictionary. When nodes store such a value,
they add sub-keys to the dictionary instead of overwriting it.

Consider an example where three DHT nodes want to find out who is going to attend the party:

```python
alice_dht = DHT(initial_peers=dht.get_visible_maddrs(), start=True)
bob_dht = DHT(initial_peers=dht2.get_visible_maddrs(), start=True)
carol_dht = DHT(initial_peers=alice_dht.get_visible_maddrs(), start=True)


# first, each peer stores a subkey for the same key
alice_dht.store('party', subkey='alice', value='yes', expiration_time=get_dht_time() + 600)
bob_dht.store('party', subkey='bob', value='yes', expiration_time=get_dht_time() + 600)
carol_dht.store('party', subkey='carol', value='no', expiration_time=get_dht_time() + 600)

# then, any peer can get the full list of attendees
attendees, expiration = alice_dht.get('party', latest=True)
print(attendees)
# {'alice': ValueWithExpiration(value='yes', expiration_time=1625504352.2668974),
# 'bob': ValueWithExpiration(value='yes', expiration_time=1625504352.2884178),
# 'carol': ValueWithExpiration(value='no', expiration_time=1625504352.3046832)}
```

When training over the Internet, some `dht.get/store` requests may run for hundreds of milliseconds and even seconds.
To minimize the wait time, you can call these requests asynchronously via
[`dht.store/get/run_coroutine(..., return_future=True)`__](https://learning-at-home.readthedocs.io/en/latest/modules/dht.html#hivemind.dht.DHT.get)
. This will run the corresponding command in background and return a [Future-like](https://docs.python.org/3/library/concurrent.futures.html) object that can be awaited.
Please also note that the returned future is compatible with asyncio (i.e. can be awaited inside the event loop).

For more details on DHT store/get and expiration time, please refer to the [documentation for DHT and DHTNode](https://learning-at-home.readthedocs.io/en/latest/modules/dht.html#dht-and-dhtnode)

### Running across the Internet

By default, DHT nodes are only accessible from your localhost. In order to run with multiple geographically
distributed computers, one must connect DHT to a global network. Currently, there are two ways achieve this.

The recommended approach is to grow the network from one or several initial peers. These can be any computers with a
public IP address that are always online. Each of these peers should simply create `hivemind.DHT` and set it to
accept incoming connections from the internet:

```python
import hivemind
dht = hivemind.DHT(
host_maddrs=["/ip4/0.0.0.0/tcp/0", "/ip4/0.0.0.0/udp/0/quic"],
start=True)

print('\n'.join(str(addr) for addr in dht.get_visible_maddrs()))
print("Global IP:", hivemind.utils.networking.choose_ip_address(dht.get_visible_maddrs()))
```

Running this code will print several, typically, 4 or 6 strings of the following form (example):
```shell
/ip4/185.185.123.124/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
/ip4/127.0.0.1/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
/ip4/185.185.123.124/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
/ip4/127.0.0.1/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
Global IP: 185.185.123.124
```
The lines that contain addresses that other nodes can use to connect to the network:
- `127.0.0.1` or `192.168.X.Y` are only accessible from your computer or local network, respectively.
- The remaining address is __global__ (`185.185.123.124` in the example, yours will be different).

To connect a new peer to the network, you should specify `initial_peers` as the addresses that
correspond to the public IP:

```python
import hivemind
dht = hivemind.DHT(
host_maddrs=["/ip4/0.0.0.0/tcp/0", "/ip4/0.0.0.0/udp/0/quic"],
initial_peers=[
"/ip4/185.185.123.124/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE",
"/ip4/185.185.123.124/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE",
], start=True)
```

Thats it, now the two DHT nodes are connected. If you connect additional peers to the network, you only need to specify
one (or a subset) of peers as `initial_peers`.
In case your peer operates behind a restrictive firewall, you may find it beneficial to set `client_mode=True`. In this
case, the DHT instance will access others, but it will not announce that other peers can connect to it.

Another (experimental) way is to use [IPFS](https://ipfs.io/): a global decentralized network for file storage.
We are not storing any files here: instead, we can use IPFS nodes to help hivemind peers find each other.
To use this strategy, set `use_ipfs=True` in each DHT node you create. This allows you to connect DHT multiple even if
all of them are behind NAT. However, this strategy may be unreliable and depend heavily on the availability of public
IPFS nodes.

To learn more about the network address format, read [libp2p addressing](https://docs.libp2p.io/concepts/addressing/)
For an example of how to set up DHT in a distributed training experiment, see
[examples/albert](https://github.com/learning-at-home/hivemind/tree/master/examples/albert)
Loading

0 comments on commit 12e8039

Please sign in to comment.