Fix RLHF slowdown in attention multi steps extend_step. #849

ds-hwang · 2024-11-19T16:27:44Z

Fix RLHF slowdown in attention multi steps extend_step.

jax.lax.dynamic_update_slice_in_dim is generally faster than advanced indexing,
but an unusual slowdown was observed, with RLHF sampling taking up to 3 hours
per run. TODO: Investigate and fix it.

For your information, in #831, I experimented
with both dynamic_update_slice and advanced indexing on TPUv4 and chose the
faster option. It's also known that dynamic_update_slice performs better when
copying contiguous memory. This is a very surprising case.

Advanced Indexing

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
QkvLinearExtendStepBenchmark/2048/16/1024/1         7.16 ms        0.623 ms          492
QkvLinearExtendStepBenchmark/2048/16/4096/1         8.52 ms        0.624 ms          561
QkvLinearExtendStepBenchmark/2048/16/32768/1        34.6 ms         1.64 ms           78
QkvLinearExtendStepBenchmark/2048/16/4096/8         63.6 ms         1.74 ms           81
QkvLinearExtendStepBenchmark/2048/16/4096/64         276 ms         2.40 ms           81
QkvLinearExtendStepBenchmark/2048/16/4096/512       2541 ms         81.6 ms            1

dynamic_update_slice

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
QkvLinearExtendStepBenchmark/2048/16/1024/1         1.70 ms        0.513 ms         1125
QkvLinearExtendStepBenchmark/2048/16/4096/1         3.40 ms        0.519 ms         1174
QkvLinearExtendStepBenchmark/2048/16/32768/1        20.1 ms        0.930 ms          404
QkvLinearExtendStepBenchmark/2048/16/4096/8         3.68 ms        0.524 ms         1139
QkvLinearExtendStepBenchmark/2048/16/4096/64        3.74 ms        0.532 ms         1125
QkvLinearExtendStepBenchmark/2048/16/4096/512       2530 ms         80.4 ms            1

jax.lax.dynamic_update_slice_in_dim is generally faster than advanced indexing, but an unusual slowdown was observed, with RLHF sampling taking up to 3 hours per run. Investigate and fix it. https://a1350286.slack.com/archives/C03HJAYC7JA/p1731998432387409?thread_ts=1731968765.840839&cid=C03HJAYC7JA For your information, in https://github.pie.apple.com/foundation-models/axlearn/pull/894, I experimented with both dynamic_update_slice and advanced indexing on TPUv4 and chose the faster option. It's also known that dynamic_update_slice performs better when copying contiguous memory. This is a very surprising case. Advanced Indexing ---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- QkvLinearExtendStepBenchmark/2048/16/1024/1 7.16 ms 0.623 ms 492 QkvLinearExtendStepBenchmark/2048/16/4096/1 8.52 ms 0.624 ms 561 QkvLinearExtendStepBenchmark/2048/16/32768/1 34.6 ms 1.64 ms 78 QkvLinearExtendStepBenchmark/2048/16/4096/8 63.6 ms 1.74 ms 81 QkvLinearExtendStepBenchmark/2048/16/4096/64 276 ms 2.40 ms 81 QkvLinearExtendStepBenchmark/2048/16/4096/512 2541 ms 81.6 ms 1 dynamic_update_slice ---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- QkvLinearExtendStepBenchmark/2048/16/1024/1 1.70 ms 0.513 ms 1125 QkvLinearExtendStepBenchmark/2048/16/4096/1 3.40 ms 0.519 ms 1174 QkvLinearExtendStepBenchmark/2048/16/32768/1 20.1 ms 0.930 ms 404 QkvLinearExtendStepBenchmark/2048/16/4096/8 3.68 ms 0.524 ms 1139 QkvLinearExtendStepBenchmark/2048/16/4096/64 3.74 ms 0.532 ms 1125 QkvLinearExtendStepBenchmark/2048/16/4096/512 2530 ms 80.4 ms 1

ruomingp

Does this fix the slowdown?

ds-hwang · 2024-11-19T17:42:51Z

Does this fix the slowdown?

Yes :) More details in 918
Thank you for review!

`k_proj` is not properly set sharding hints, so QKVLinear.extend_step cannot create next `cached_key` with proper hints. This causes OOM for diffusion model, because the code cannot know the local batch size. Shape: f32[1024,2048,8,128]{3,2,1,0:T(8,128)} Unpadded size: 8.00G To fix it, copy `cached_key.sharding` to `k_proj.sharding`, as `cached_key` sharding is properly set up. In addition, this is the reason of RLHF slowdown, so revert the workaround change. apple#849

jax.lax.dynamic_update_slice_in_dim is generally faster than advanced indexing, but an unusual slowdown was observed, with RLHF sampling taking up to 3 hours per run. Investigate and fix it. https://a1350286.slack.com/archives/C03HJAYC7JA/p1731998432387409?thread_ts=1731968765.840839&cid=C03HJAYC7JA For your information, in https://github.pie.apple.com/foundation-models/axlearn/pull/894, I experimented with both dynamic_update_slice and advanced indexing on TPUv4 and chose the faster option. It's also known that dynamic_update_slice performs better when copying contiguous memory. This is a very surprising case. Advanced Indexing ---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- QkvLinearExtendStepBenchmark/2048/16/1024/1 7.16 ms 0.623 ms 492 QkvLinearExtendStepBenchmark/2048/16/4096/1 8.52 ms 0.624 ms 561 QkvLinearExtendStepBenchmark/2048/16/32768/1 34.6 ms 1.64 ms 78 QkvLinearExtendStepBenchmark/2048/16/4096/8 63.6 ms 1.74 ms 81 QkvLinearExtendStepBenchmark/2048/16/4096/64 276 ms 2.40 ms 81 QkvLinearExtendStepBenchmark/2048/16/4096/512 2541 ms 81.6 ms 1 dynamic_update_slice ---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- QkvLinearExtendStepBenchmark/2048/16/1024/1 1.70 ms 0.513 ms 1125 QkvLinearExtendStepBenchmark/2048/16/4096/1 3.40 ms 0.519 ms 1174 QkvLinearExtendStepBenchmark/2048/16/32768/1 20.1 ms 0.930 ms 404 QkvLinearExtendStepBenchmark/2048/16/4096/8 3.68 ms 0.524 ms 1139 QkvLinearExtendStepBenchmark/2048/16/4096/64 3.74 ms 0.532 ms 1125 QkvLinearExtendStepBenchmark/2048/16/4096/512 2530 ms 80.4 ms 1

ds-hwang requested review from ruomingp and markblee as code owners November 19, 2024 16:27

ruomingp approved these changes Nov 19, 2024

View reviewed changes

ds-hwang added this pull request to the merge queue Nov 19, 2024

Merged via the queue into apple:main with commit 2803b36 Nov 19, 2024
10 checks passed

ds-hwang deleted the mult_bug_fix branch November 19, 2024 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RLHF slowdown in attention multi steps extend_step. #849

Fix RLHF slowdown in attention multi steps extend_step. #849

ds-hwang commented Nov 19, 2024

ruomingp left a comment

ds-hwang commented Nov 19, 2024 •

edited

Loading

Fix RLHF slowdown in attention multi steps extend_step. #849

Fix RLHF slowdown in attention multi steps extend_step. #849

Conversation

ds-hwang commented Nov 19, 2024

ruomingp left a comment

Choose a reason for hiding this comment

ds-hwang commented Nov 19, 2024 • edited Loading

ds-hwang commented Nov 19, 2024 •

edited

Loading