Performance Analysis of BytePS

You can analyze the fine-grained performance of BytePS with the profiling tool.

For Communication Operations

Usage

Use the following environment variables to enable profiling the communication operations:

"BYTEPS_TRACE_ON" = "1"
"BYTEPS_TRACE_END_STEP" = "20"
"BYTEPS_TRACE_START_STEP"="10"
"BYTEPS_TRACE_DIR"= "./traces"

First BYTEPS_TRACE_ON should be set to 1 to enable profiling communication traces. BYTEPS_TRACE_START_STEP and BYTEPS_TRACE_END_STEP decide the step interval we want to profile, traces from step BYTEPS_TRACE_START_STEP to step BYTEPS_TRACE_END_STEP steps will be automatically collected and the result traces will be output in the chrome trace format. BYTEPS_TRACE_DIR denotes the path where you want to store traces.

The result directory is organized as follows.

traces/
├── 0
│   └── comm.json
│ 
└── 1
    └── comm.json

Here, traces/ is the trace directory we defined using BYTEPS_TRACE_DIR. traces/ contains several sub-directories, each of which denotes one GPU and is named with the local rank of this GPU, e.g., path ./traces/0/ stores the traces results of the GPU whose local rank is 0. Each sub-directory contains following directories/files:

comm.json: the final trace file which contains the communication traces of all gradients;

Trace Format

Let's look deep into the traces.

{
    "ph": "X",
    "args": {
        "name": "Comm.byteps.gradient_0"
    },
    "pid": "Comm.byteps.gradient_0",
    "name": "Comm.byteps.gradient_0",
    "ts": 1574685989504865,
    "dur": 24026,
    "tid": "total"
},
{
    "ph": "X",
    "args": {
        "name": "Comm.byteps.gradient_0"
    },
    "pid": "Comm.byteps.gradient_0",
    "name": "Comm.byteps.gradient_0.BROADCAST",
    "ts": 1574685984662375,
    "dur": 1074,
    "tid": "26148864"
}

Basically, the trace event format is the same as the standard Trace Event Format. Here, name is the name of one event, which can be shown on chrome://tracing/. Considering BytePS divides each gradinets to multiple partitions if necessary and each partition needs to go through several types of following operations, namely QueueType.

  "COORDINATE_REDUCE",
  "REDUCE",
  "COPYD2H",
  "PCIE_REDUCE",
  "COORDINATE_PUSH",
  "PUSH",
  "PULL",
  "COPYH2D",
  "COORDINATE_BROADCAST",
  "BROADCAST"

So there are two types of events:

If tid is total, the event records the entire interval to synchronize one gradient, including the queue time. In this case, name ends with the gradient index.
If tid is a number, the event records the interval for each QueueType of each partition of one gradient. In this case, name ends with the gradient index and the corresponding QueueType, tid denotes the partition id.

Note that for BytePS, for multiple GPUs on one worker, only the root GPU is responsible for synchronizing with servers, and these GPUs located on one worker update parameters through all-reduce. Therefore, you can observe PUSH and PULL operations only in the traces of the root GPU. By default, the root GPU is one with the largest local rank.

Below shows a visualization example of comm.json.

Overhead

Below shows the latency when running bert_12_768_12 model with 2 workers, each containing 2 V100 GPUs with 16GB of memory. BytePS Timeline collects traces during step 10 to step 20 and after step 20, it asynchronously outputs the trace results, which may also cause extra overhead. Ignoring the warm up phase (the first 10 steps), the overhead induced by BytePS Timeline is small.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeline.md

timeline.md

Performance Analysis of BytePS

For Communication Operations

Usage

Trace Format

Overhead

Files

timeline.md

Latest commit

History

timeline.md

File metadata and controls

Performance Analysis of BytePS

For Communication Operations

Usage

Trace Format

Overhead