Skip to content

Commit

Permalink
doc: readme revision
Browse files Browse the repository at this point in the history
To reflect changes made in v0.1.0, the top-level README.md is revised.
Also, the description for examples is updated to highlight that the
multiworld examples are fault-tolerant.
  • Loading branch information
myungjin committed Jul 26, 2024
1 parent bb0387d commit ededdbb
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 17 deletions.
10 changes: 1 addition & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,7 @@ The framework is built on top of PyTorch, a widely-used deep learning framework,

## Dependencies and Version

* [PyTorch](https://pytorch.org/get-started/previous-versions/#v221) version: `2.2.1`

## Prerequisites

* Install [anaconda](www.anaconda.com/download/) or [miniconda](https://docs.conda.io/en/latest/miniconda.html) in order to create the environment.
* Clone repo (you could use `git clone https://github.com/cisco-open/pymultiworld.git`).
* This prerequiste is only for testing a fault tolerance functionality across hosts.
* To test the functionality **in a single machine**, this step can be skipped. Do the remaining installation steps from [here](#installation).
* Too run the test **across hosts**, a custom-built PyTorch is necessary. Follow instructions in this [doc](docs/pytorch_build.md). Details on why to build a custom PyTorch are found in the doc too.
* [PyTorch](https://pypi.org/project/torch/2.4.0/) version: `2.4.0`

## Installation

Expand Down
18 changes: 10 additions & 8 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
# Examples

The list of available examples can be found here:
This folder contains the examples of various CCL operations. The examples implemented with multiworld's API are fault-tolerant across worlds (i.e., process groups). In other words, when a failure (e.g., network failure, node failure, process crash, etc.) impact a world or one worker in the world, all the workers in the same world is only affected while all other workers in another world won't be affected. **To see the effect of multiworld framework, try to terminate a worker in one world when any multiworld example is tested.**

The list of available examples are as follows:

## Point-to-Point Communication

* [`send_recv - multiple worlds`](send_recv/m8d.py) demonstrates a case where a leader process receives data (e.g., tensors) from workers that belong to different worlds (i.e., process groups).
* [`send_recv - single world`](send_recv/single_world.py) is an example that utilizes the native PyTorch distributed package to send tensors among processes in a single world (i.e., one process group). This example shows that the default process group management can't handle the fault gracefully during model serving.
* [`resnet`](resnet) demonstrates a use case where a ResNet model is run across two workers and failure on one worker won't affect the operation of the other due to the fault domain isolation with the ability of creating multiple worlds (i.e., multiple independent process group).
* [`send_recv: multiple worlds`](send_recv/m8d.py) demonstrates a case where a leader process receivesf data (e.g., tensors) from workers that belong to different worlds (i.e., process groups).
* [`send_recv: single world`](send_recv/single_world.py) is an example that utilizes the native PyTorch distributed package to send tensors among processes in a single world. This example shows that the default process group management can't handle a fault gracefully during tensor tranmission.
* [`resnet`](resnet) demonstrates a use case where a ResNet model is run across two workers and failure on one worker won't affect the operation of the other due to the fault domain isolation with the ability of creating multiple worlds (i.e., multiple independent process groups).

## Collective Communication

* [`all_reduce`](all_reduce) This script demonstrates a case where all_reduce on tensors are executed for different worlds, without any interference across different worlds.
* [`all_gather`](all_gather) This script demonstrates a case where all_gather on tensors are executed for different worlds, without any interference across different worlds.
* [`broadcast`](broadcast) This script demonstrates a case where broadcast is executed for different worlds, without any interference across different worlds, with a different src on each step.
* [`reduce`](reduce) This script demonstrates a case where reduce is executed for different worlds, on a different destination rank, without any interference across different worlds
* [`broadcast`](broadcast) Broadcast is a CCL operation where one worker (rank) as source broadcasts its data to the rest of workers in the same world. This script demonstrates a case where broadcast is executed with different worlds.
* [`reduce`](reduce) Reduce is a CCL operation where the values from all workers (ranks) in the same world is aggregated and the final aggregated result is sent to a destination worker. This script demonstrates a case where reduce is executed with different worlds.
* [`all_reduce`](all_reduce) All-reduce is a CCL operation where all workers (ranks) participate in a reduce operation and receive the same final result at the end of the operation. This script demonstrates a case where all_reduce on tensors is executed with different worlds.
* [`all_gather`](all_gather) All-gather is a CCL operation where all workers (ranks) gather values owned by other workers in a distributed manner. This script demonstrates a case where all_gather on tensors is executed with different worlds.

0 comments on commit ededdbb

Please sign in to comment.