Merge remote-tracking branch 'origin/master' into averager-libp2p

learning-at-home · Jul 16, 2021 · 12e8039 · 12e8039
2 parents 2e51140 + 4a006ae
commit 12e8039
Show file tree

Hide file tree

Showing 16 changed files with 605 additions and 233 deletions.
diff --git a/README.md b/README.md
@@ -1,36 +1,38 @@
 ## Hivemind: decentralized deep learning in PyTorch
 
-[![CI status](https://github.com/learning-at-home/hivemind/actions/workflows/run-tests.yml/badge.svg?branch=master)](https://github.com/learning-at-home/hivemind/actions)
 [![Documentation Status](https://readthedocs.org/projects/learning-at-home/badge/?version=latest)](https://learning-at-home.readthedocs.io/en/latest/?badge=latest)
-[![Gitter](https://badges.gitter.im/learning-at-home/hivemind.svg)](https://gitter.im/learning-at-home/hivemind?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
+[![PyPI version](https://img.shields.io/pypi/v/hivemind.svg)](https://pypi.org/project/hivemind/)
+[![Discord](https://img.shields.io/static/v1?style=default&label=Discord&logo=discord&message=join)](https://discord.gg/xC7ucM8j)
+[![CI status](https://github.com/learning-at-home/hivemind/actions/workflows/run-tests.yml/badge.svg?branch=master)](https://github.com/learning-at-home/hivemind/actions)
+![Codecov](https://img.shields.io/codecov/c/github/learning-at-home/hivemind)
 [![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 
-Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage is training a
-single Transformer model on hundreds of computers from different universities, companies, and volunteers.
+Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one
+large model on hundreds of computers from different universities, companies, and volunteers.
 
 ![img](https://i.imgur.com/GPxolxb.gif)
 
 ## Key Features
 
-* Train neural networks of arbitrary size: parts of their layers are distributed across the participants.
 * Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized
   network.
 * Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too
   long to respond.
-* Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to
-  synchronize across the entire network.
+* Decentralized parameter averaging: iteratively aggregate updates from multiple
+  workers without the need to synchronize across the entire network ([paper](https://arxiv.org/abs/2103.03239)).
+* Train neural networks of arbitrary size: parts of their layers are distributed across the participants with the
+  decentralized mixture-of-experts ([paper](https://arxiv.org/abs/2002.04013)).
 
 To learn more about the ideas behind this library, see https://learning-at-home.github.io or read
 the [NeurIPS 2020 paper](https://arxiv.org/abs/2002.04013).
 
 ## Installation
 
-Before installing hivemind, make sure that your environment has Python 3.7+
-and [PyTorch](https://pytorch.org/get-started/locally/#start-locally) with a version at least as new as 1.6.0.
+Before installing, make sure that your environment has Python 3.7+ 
+and [PyTorch](https://pytorch.org/get-started/locally/#start-locally) 1.6.0 or newer.
+You can install them either natively or with [Anaconda](https://www.anaconda.com/products/individual).
 
-To start using this library, you can either use the pip package manager or build it from source. Since currently the
-release cycle is not established yet, we recommend installing hivemind from source to keep up with the latest bugfixes
-and improvements.
+You can install [the latest release](https://pypi.org/project/hivemind) with pip or build hivemind from source.
 
 ### With pip
 
@@ -42,7 +44,7 @@ pip install hivemind
 
 ### From source
 
-To install hivemind from source, simply clone the repository and install
+To install hivemind from source, simply run the following:
 
 ```
 git clone https://github.com/learning-at-home/hivemind.git
@@ -53,11 +55,31 @@ pip install .
 If you would like to verify that your installation is working properly, you can install with `pip install -e .[dev]`
 instead. Then, you can run the tests with `pytest tests/`.
 
+By default, hivemind uses the precompiled binary of
+the [go-libp2p-daemon](https://github.com/learning-at-home/go-libp2p-daemon) library. If you face compatibility issues
+or want to build the binary yourself, you can recompile it by running `pip install . --global-option="--buildgo"`.
+Before running the compilation, please ensure that your machine has a recent version
+of [Go toolchain](https://golang.org/doc/install) (1.15 or higher).
+
+### System requirements
+- __Linux__ is the default OS for which hivemind is developed and tested. We recommend Ubuntu 18.04+ (64-bit),
+  but other 64-bit distros should work as well. Legacy 32-bit is not recommended.
+- __macOS 10.x__ mostly works but requires building hivemind from source, and some edge cases may fail.
+  To ensure full compatibility, we recommend using [our Docker image](https://hub.docker.com/r/learningathome/hivemind).
+- __Windows 10+ (experimental)__ can run hivemind using [WSL](https://docs.microsoft.com/ru-ru/windows/wsl/install-win10).
+  You can configure WSL to use GPU following [this guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) by NVIDIA.
+  After the CUDA toolkit is installed you can simply follow the instructions above to install with pip or from source.
+
 ## Documentation
 
-* [Quickstart](https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html): install hivemind, set up a
-  server and train experts
-* Documentation & guides are available at [learning-at-home.readthedocs.io](https://learning-at-home.readthedocs.io)
+* The [quickstart tutorial](https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html) walks through installation
+  and a training a simple neural network with several peers.  
+* [examples/albert](https://github.com/learning-at-home/hivemind/tree/master/examples/albert) contains the starter kit
+  and instructions for training a Transformer masked language model collaboratively.
+* API reference and additional tutorials are available at [learning-at-home.readthedocs.io](https://learning-at-home.readthedocs.io)
+
+If you have any questions about installing and using hivemind, you can ask them in 
+[our Discord chat](https://discord.gg/xC7ucM8j) or file an [issue](https://github.com/learning-at-home/hivemind/issues).
 
 ## Contributing
 
@@ -66,7 +88,7 @@ documentation improvements to entirely new features, is equally appreciated.
 
 If you want to contribute to hivemind but don't know where to start, take a look at the
 unresolved [issues](https://github.com/learning-at-home/hivemind/issues). Open a new issue or
-join [our chat room](https://gitter.im/learning-at-home/hivemind) in case you want to discuss new functionality or
+join [our chat room](https://discord.gg/xC7ucM8j) in case you want to discuss new functionality or
 report a possible bug. Bug fixes are always welcome, but new features should be preferably discussed with maintainers
 beforehand.
 
@@ -77,7 +99,7 @@ our [guide](https://learning-at-home.readthedocs.io/en/latest/user/contributing.
 
 ## Citation
 
-If you found hivemind or its underlying algorithms useful for your experiments, please cite the following source:
+If you found hivemind or its underlying algorithms useful for your research, please cite the relevant papers:
 
 ```
 @misc{hivemind,
@@ -88,7 +110,8 @@ If you found hivemind or its underlying algorithms useful for your experiments,
 }
 ```
 
-Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired the creation of this library:
+Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired the creation of this library
+(prototype implementation of hivemind available at [mryab/learning-at-home](https://github.com/mryab/learning-at-home)):
 
 ```
 @inproceedings{ryabinin2020crowdsourced,
@@ -104,10 +127,49 @@ Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired t
 }
 ```
 
-The initial implementation of hivemind used for the paper is available
-at [mryab/learning-at-home](https://github.com/mryab/learning-at-home).
+<details>
+ <summary>Additional publications</summary>
+
+["Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices"](https://arxiv.org/abs/2103.03239)
+
+```
+@misc{ryabinin2021moshpit,
+      title={Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices}, 
+      author={Max Ryabinin and Eduard Gorbunov and Vsevolod Plokhotnyuk and Gennady Pekhimenko},
+      year={2021},
+      eprint={2103.03239},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
+
+["Distributed Deep Learning in Open Collaborations"](https://arxiv.org/abs/2106.10207)
+
+```
+@misc{diskin2021distributed,
+      title={Distributed Deep Learning in Open Collaborations}, 
+      author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
+      year={2021},
+      eprint={2106.10207},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
+
+["Secure Distributed Training at Scale"](https://arxiv.org/abs/2106.11257)
+
+```
+@misc{gorbunov2021secure,
+      title={Secure Distributed Training at Scale}, 
+      author={Eduard Gorbunov and Alexander Borzunov and Michael Diskin and Max Ryabinin},
+      year={2021},
+      eprint={2106.11257},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
 
-In the documentation, we list
-several [related](https://learning-at-home.readthedocs.io/en/latest/user/acknowledgements.html) projects and
-acknowledgements.
+</details>
 
+We also maintain a list of [related projects and
+acknowledgements](https://learning-at-home.readthedocs.io/en/latest/user/acknowledgements.html).
diff --git a/docs/index.rst b/docs/index.rst
@@ -21,6 +21,8 @@ documentation below.
 
   user/quickstart
   modules/index
+  user/dht
+  user/moe
   user/contributing
   user/benchmarks
   user/acknowledgements

diff --git a/docs/user/acknowledgements.md b/docs/user/acknowledgements.md
@@ -1,6 +1,6 @@
-# Credits
+# Acknowledgements
 
-We kindly thank (in random order)
+We kindly thank (in no particular order)
 
 * [Artem Babenko](https://research.yandex.com/people/102794) and
   [Vladimir Aliev](https://ru.linkedin.com/in/vladimir-aliev-19b93282) for helpful discussions and editorial review of
@@ -14,15 +14,15 @@ We kindly thank (in random order)
 * [Brian Muller](https://github.com/bmuller/kademlia) for his implementations
   of [kademlia](https://github.com/bmuller/kademlia) and [rpcudp](https://github.com/bmuller/rpcudp)
 * Alexander Sherbakov for helpful discussions on PC and server component architecture,
-* Our early adopters, [contributors](https://github.com/learning-at-home/hivemind/graphs/contributors), and reviewers
+* [Yandex School of Data Analysis](https://yandexdataschool.com) students, for helping us run first truly collaborative experiments.
+* The [Neuropark community](https://neuropark.co/), for hosting early collaborative training experiments of sahajBERT with hivemind.
+* Our early adopters, [contributors](https://github.com/learning-at-home/hivemind/graphs/contributors), and conference reviewers.
 
 # Related projects
 
-We also want to reference several projects that have similar ideas in mind:
-
+In this section, we list several organizations and research projects that bring humanity closer to the dream of world-scale deep learning with volunteer computing.
+* [Hugging Face](https://huggingface.co) — an AI community with world-leading NLP research that builds collaborative hub training using hivemind. 
+* [EYDLE](https://www.eydle.com) — a start-up that works towards distributed deep learning on volunteer hardware using centralized infrastructure.
 * [BitTensor](https://github.com/opentensor/BitTensor) — a decentralized deep learning ecosystem with incentive
-  mechanism. Like hivemind, but peers are getting rewarded for their contribution to other peers. .
-* [GShard](https://arxiv.org/abs/2006.16668) — a paper by Dmitry Lepikhin et al. that demonstrate the effectiveness of
-  huge Mixture-of-Experts models on conventional hpc hardware. Those guys train models 4 times the size of GPT-3 on
-  thousands of TPUv3.
-* Also doing research in decentralized deep learning? Let us know!
+  mechanism. Each peer trains for its own objective and rewards others for useful features. 
+* Also building collaborative deep learning? Let us know! `hivemind-team <at> hotmail.com`
diff --git a/docs/user/dht.md b/docs/user/dht.md
@@ -0,0 +1,135 @@
+# Hivemind DHT
+
+In order to coordinate, hivemind peers form a Distributed Hash Table: distributed "dictionary" where each peer
+can store and get values. To initialize the first DHT node, run
+
+```python
+from hivemind import DHT, get_dht_time
+
+dht = DHT(start=True)
+# create the first DHT node that listens for incoming connections from localhost only
+
+print("For incoming connections, use:", dht.get_visible_maddrs())
+```
+
+You can now start more peers that connect to an existing DHT node using its listen address:
+```python
+dht2 = DHT(initial_peers=dht.get_visible_maddrs(), start=True)
+```
+
+Note that `initial_peers` contains the address of the first DHT node.
+This implies that the resulting node will have shared key-value with the first node, __as well as any other
+nodes connected to it.__ When the two nodes are connected, subsequent peers can use any one of them (or both)
+as `initial_peers` to connect to the shared "dictionary".
+
+### Store/get operations
+
+Once the DHT is formed, all participants can `dht.store` key-value pairs in the DHT and `dht.get` them by key:
+
+```python
+# first node: store a key-value pair for 600 seconds
+store_ok = dht.store('my_key', ('i', 'love', 'bees'),
+                     expiration_time=get_dht_time() + 600)
+
+# second node: get the value stored by the first node
+value, expiration = dht2.get('my_key', latest=True)
+assert value == ('i', 'love', 'bees')
+```
+
+As you can see, each value in a hivemind DHT is associated with an expiration time,
+computed current `get_dht_time()` with some offset.
+This expiration time is used to cleanup old data and resolve write conflicts: 
+DHT nodes always prefer values with higher expiration time and may delete any value past its expiration.
+
+### Values with subkeys
+
+Hivemind DHT also supports a special value type that is itself a dictionary. When nodes store such a value,
+they add sub-keys to the dictionary instead of overwriting it.
+
+Consider an example where three DHT nodes want to find out who is going to attend the party:
+
+```python
+alice_dht = DHT(initial_peers=dht.get_visible_maddrs(), start=True)
+bob_dht = DHT(initial_peers=dht2.get_visible_maddrs(), start=True)
+carol_dht = DHT(initial_peers=alice_dht.get_visible_maddrs(), start=True)
+
+
+# first, each peer stores a subkey for the same key
+alice_dht.store('party', subkey='alice', value='yes', expiration_time=get_dht_time() + 600)
+bob_dht.store('party', subkey='bob', value='yes', expiration_time=get_dht_time() + 600)
+carol_dht.store('party', subkey='carol', value='no', expiration_time=get_dht_time() + 600)
+
+# then, any peer can get the full list of attendees
+attendees, expiration = alice_dht.get('party', latest=True)
+print(attendees)
+# {'alice': ValueWithExpiration(value='yes', expiration_time=1625504352.2668974),
+#  'bob': ValueWithExpiration(value='yes', expiration_time=1625504352.2884178),
+#  'carol': ValueWithExpiration(value='no', expiration_time=1625504352.3046832)}
+```
+
+When training over the Internet, some `dht.get/store` requests may run for hundreds of milliseconds and even seconds.
+To minimize the wait time, you can call these requests asynchronously via 
+[`dht.store/get/run_coroutine(..., return_future=True)`__](https://learning-at-home.readthedocs.io/en/latest/modules/dht.html#hivemind.dht.DHT.get)
+. This will run the corresponding command in background and return a [Future-like](https://docs.python.org/3/library/concurrent.futures.html) object that can be awaited.
+Please also note that the returned future is compatible with asyncio (i.e. can be awaited inside the event loop).
+
+For more details on DHT store/get and expiration time, please refer to the [documentation for DHT and DHTNode](https://learning-at-home.readthedocs.io/en/latest/modules/dht.html#dht-and-dhtnode)
+
+### Running across the Internet
+
+By default, DHT nodes are only accessible from your localhost. In order to run with multiple geographically
+distributed computers, one must connect DHT to a global network. Currently, there are two ways achieve this.
+
+The recommended approach is to grow the network from one or several initial peers. These can be any computers with a
+public IP address that are always online. Each of these peers should simply create `hivemind.DHT` and set it to
+accept incoming connections from the internet:
+
+```python
+import hivemind
+dht = hivemind.DHT(
+    host_maddrs=["/ip4/0.0.0.0/tcp/0", "/ip4/0.0.0.0/udp/0/quic"],
+    start=True)
+
+print('\n'.join(str(addr) for addr in dht.get_visible_maddrs()))
+print("Global IP:", hivemind.utils.networking.choose_ip_address(dht.get_visible_maddrs()))
+```
+
+Running this code will print several, typically, 4 or 6 strings of the following form (example):
+```shell
+/ip4/185.185.123.124/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
+/ip4/127.0.0.1/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
+/ip4/185.185.123.124/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
+/ip4/127.0.0.1/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE
+Global IP: 185.185.123.124
+```
+The lines that contain addresses that other nodes can use to connect to the network:
+- `127.0.0.1` or `192.168.X.Y` are only accessible from your computer or local network, respectively.
+- The remaining address is __global__ (`185.185.123.124` in the example, yours will be different).
+
+To connect a new peer to the network, you should specify `initial_peers` as the addresses that 
+correspond to the public IP:
+
+```python
+import hivemind
+dht = hivemind.DHT(
+    host_maddrs=["/ip4/0.0.0.0/tcp/0", "/ip4/0.0.0.0/udp/0/quic"],
+    initial_peers=[
+        "/ip4/185.185.123.124/tcp/40615/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE",
+        "/ip4/185.185.123.124/udp/40346/quic/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE",
+    ], start=True)
+```
+
+Thats it, now the two DHT nodes are connected. If you connect additional peers to the network, you only need to specify
+one (or a subset) of peers as `initial_peers`.
+In case your peer operates behind a restrictive firewall, you may find it beneficial to set `client_mode=True`. In this
+ case, the DHT instance will access others, but it will not announce that other peers can connect to it.
+
+Another (experimental) way is to use [IPFS](https://ipfs.io/): a global decentralized network for file storage.
+We are not storing any files here: instead, we can use IPFS nodes to help hivemind peers find each other.
+To use this strategy, set `use_ipfs=True` in each DHT node you create. This allows you to connect DHT multiple even if
+all of them are behind NAT. However, this strategy may be unreliable and depend heavily on the availability of public
+IPFS nodes.
+
+To learn more about the network address format, read [libp2p addressing](https://docs.libp2p.io/concepts/addressing/)
+For an example of how to set up DHT in a distributed training experiment, see
+ [examples/albert](https://github.com/learning-at-home/hivemind/tree/master/examples/albert)