Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: readme revision #57

Merged
merged 1 commit into from
Jul 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 1 addition & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,7 @@ The framework is built on top of PyTorch, a widely-used deep learning framework,

## Dependencies and Version

* [PyTorch](https://pytorch.org/get-started/previous-versions/#v221) version: `2.2.1`

## Prerequisites

* Install [anaconda](www.anaconda.com/download/) or [miniconda](https://docs.conda.io/en/latest/miniconda.html) in order to create the environment.
* Clone repo (you could use `git clone https://github.com/cisco-open/pymultiworld.git`).
* This prerequiste is only for testing a fault tolerance functionality across hosts.
* To test the functionality **in a single machine**, this step can be skipped. Do the remaining installation steps from [here](#installation).
* Too run the test **across hosts**, a custom-built PyTorch is necessary. Follow instructions in this [doc](docs/pytorch_build.md). Details on why to build a custom PyTorch are found in the doc too.
* [PyTorch](https://pypi.org/project/torch/2.4.0/) version: `2.4.0`

## Installation

Expand Down
18 changes: 10 additions & 8 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
# Examples

The list of available examples can be found here:
This folder contains the examples of various CCL operations. The examples implemented with multiworld's API are fault-tolerant across worlds (i.e., process groups). In other words, when a failure (e.g., network failure, node failure, process crash, etc.) impact a world or one worker in the world, all the workers in the same world is only affected while all other workers in another world won't be affected. **To see the effect of multiworld framework, try to terminate a worker in one world when any multiworld example is tested.**

The list of available examples are as follows:

## Point-to-Point Communication

* [`send_recv - multiple worlds`](send_recv/m8d.py) demonstrates a case where a leader process receives data (e.g., tensors) from workers that belong to different worlds (i.e., process groups).
* [`send_recv - single world`](send_recv/single_world.py) is an example that utilizes the native PyTorch distributed package to send tensors among processes in a single world (i.e., one process group). This example shows that the default process group management can't handle the fault gracefully during model serving.
* [`resnet`](resnet) demonstrates a use case where a ResNet model is run across two workers and failure on one worker won't affect the operation of the other due to the fault domain isolation with the ability of creating multiple worlds (i.e., multiple independent process group).
* [`send_recv: multiple worlds`](send_recv/m8d.py) demonstrates a case where a leader process receivesf data (e.g., tensors) from workers that belong to different worlds (i.e., process groups).
* [`send_recv: single world`](send_recv/single_world.py) is an example that utilizes the native PyTorch distributed package to send tensors among processes in a single world. This example shows that the default process group management can't handle a fault gracefully during tensor tranmission.
* [`resnet`](resnet) demonstrates a use case where a ResNet model is run across two workers and failure on one worker won't affect the operation of the other due to the fault domain isolation with the ability of creating multiple worlds (i.e., multiple independent process groups).

## Collective Communication

* [`all_reduce`](all_reduce) This script demonstrates a case where all_reduce on tensors are executed for different worlds, without any interference across different worlds.
* [`all_gather`](all_gather) This script demonstrates a case where all_gather on tensors are executed for different worlds, without any interference across different worlds.
* [`broadcast`](broadcast) This script demonstrates a case where broadcast is executed for different worlds, without any interference across different worlds, with a different src on each step.
* [`reduce`](reduce) This script demonstrates a case where reduce is executed for different worlds, on a different destination rank, without any interference across different worlds
* [`broadcast`](broadcast) Broadcast is a CCL operation where one worker (rank) as source broadcasts its data to the rest of workers in the same world. This script demonstrates a case where broadcast is executed with different worlds.
* [`reduce`](reduce) Reduce is a CCL operation where the values from all workers (ranks) in the same world is aggregated and the final aggregated result is sent to a destination worker. This script demonstrates a case where reduce is executed with different worlds.
* [`all_reduce`](all_reduce) All-reduce is a CCL operation where all workers (ranks) participate in a reduce operation and receive the same final result at the end of the operation. This script demonstrates a case where all_reduce on tensors is executed with different worlds.
* [`all_gather`](all_gather) All-gather is a CCL operation where all workers (ranks) gather values owned by other workers in a distributed manner. This script demonstrates a case where all_gather on tensors is executed with different worlds.