Generation server using HF accelerate and DS inference #321

mayank31398 · 2022-07-30T07:58:53Z

This PR depends on
There are some redundant methods in some scripts that can be removed once #308 is merged into main branch
This PR is for adding scripts for creating a generation server using both HF accelerate and DeepSpeed inference.

This reverts commit a40d816.

Co-authored-by: Jeff Rasley <[email protected]>

This reverts commit 6c6c64a.

This reverts commit ad2e3d3.

* Propose a faster preprocessing mechanim by reducing the interprocesses communications * Add flush in order to force print * Try to prevent dead locks * Woops * Trying to figure out what causes deadlock * Limit queue size to 1_000_000 * Drastically reduce the maximum number of element in the queue * Threading does not use a worker * Remove shard files and factorise shard naming * Document high number of worker preprocessing script * Improve naming * Update comments and readmes * Woops * Remove the notion of vanilla and point to the script instead * Rephrase readme to use around 60 cores instead of 40 Co-authored-by: Thomas <ö[email protected]>

* Training groupings * validation grouping * steps vs samples * iteration time (speed -> samples or iterations per second) * tensorboard group time (from `log_timers_to_tensorboard`) * comment on the writing condition * Update megatron/global_vars.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * link bug fix issue on megatron-lm side Co-authored-by: Stas Bekman <[email protected]>

* chore: update requirements.txt * chore: rm deepspeed README already specifies this in greater detail.

* Update gpt2_tokenization.py Adding LRU cache and speeding up tokenization. * Update gpt2_tokenization.py Removing _old method. Note that the chinese token processing is optional and not used currently in training. * Update gpt2_tokenization.py * Update preprocess_data.py The path needs to be set before we can find the "megatron" package. * Update gpt2_tokenization.py Adding comments about max_token_len_cache * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Thomas Wang <[email protected]> * Update gpt2_tokenization.py * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Stas Bekman <[email protected]> * Update gpt2_tokenization.py * Update gpt2_tokenization.py Co-authored-by: Thomas Wang <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

Fix markdown formatting

…e of lr-decay-style

* feat: add glu variant activations * fix: rm extraneous parentheses * feat: rm bias to support jit * fix: replace negative dim with explicit dim * fix: use `x.ndim` for generic dim handling * docs: add note on version for posterity Co-authored-by: Stas Bekman <[email protected]> * docs: specify jit in `x.ndim` comment Co-authored-by: Stas Bekman <[email protected]> * test: add simple tests to check activations * fix: use `torch.testing` for tensor checks * test: use seed-controlled random batch inputs Co-authored-by: Stas Bekman <[email protected]>

* Use new zero.Init() API (#10) * query deepspeed global grad norm (#8) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Shaden Smith <[email protected]>

* indexed_dataset: use numpy to compute byte offsets faster * preprocess with huggingface datasets and mpi * preprocess_dataset_mpi: add --shuffle and --seed options * indexed_dataset: fix to handle file with 0 items * preprocess_dataset_mpi: add --split and --count options * update script comments to reflect shuffle behavior * add torch.distributed version * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * add estimated progress logging * avoid downloading dataset unless user really wants to * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * refactor main into more functions * reformat progress messages * move mpi4py import test to get_args * drop Open MPI variables from init_process_group * add --local_rank to support torch.distributed.launch * update from DeepSpeedExamples * raise exceptions on errors * drop --download option * format byte rate as MB/s * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * move datasets import back to top * import config from datasets Co-authored-by: Thomas Wang <[email protected]>

* add test suite * add test suite

…#63) * shuffle index list with numpy, scatter list, use file for large lists * drop unused idx_end from index scatter * drop scatter list file to simplify, can add back if needed * rework scatterv, recompute num_samples when needed * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * fix spacing Co-authored-by: Thomas Wang <[email protected]>

mayank31398 · 2022-08-04T22:47:09Z

@pai4451 I am open to suggestions if you have.
I was thinking of running a server for each of the 8 processes and the 0th process will receive a generate request from the user and call the server of other processes.
But I am not sure how much overhead this would cause.

pai4451 · 2022-08-05T01:03:23Z

@mayank31398 Thanks, I’m also working on serving BLOOM with DeepSpeed. I think this solution might work, but in terms of serving we have to consider the maintenance cost. The difficult part I think is to keep all processes stable (alive and synchronized).

mayank31398 · 2022-08-06T20:09:05Z

@pai4451 can you give this latest code a try?
I am able to run the server but the code gets stuck on model.generate()
I don't really understand why

mayank31398 · 2022-08-06T20:46:51Z

What the current code is doing:
It creates 8 servers, 1 on the main HOST:IP and other 7 on 127.0.0.1:IP+1, 127.0.0.1:IP+2, ...
The main server sends call to other 7 servers to run the generate method.
I see that the code just gets stuck in line 165 after first request is sent.

The code is working upto line 164 though (when a request is sent). I see tokenized input on all 8 processes.

@pai4451

pai4451 · 2022-08-07T04:08:56Z

@pai4451 can you give this latest code a try? I am able to run the server but the code gets stuck on model.generate() I don't really understand why

@mayank31398 I also get stuck on the line model.generate(). Maybe some processes failed to communicate with others or the processes are not synchronized? I doubt the way to launch the server via deepspeed might cause some processes communication problems.

mayank31398 · 2022-08-07T12:08:21Z

@pai4451 DS inference server is working now.
I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now.
Instructions are in README

mayank31398 · 2022-08-07T12:17:26Z

@stas00 , i would like to contribute this to the bloom-inference branch if its all right?
Currently, the scripts are only working with batch size = 1
2 scripts have been added (with a little code refactoring of other scripts).
I am working on increasing the batch size 🤗 now.

pai4451 · 2022-08-07T13:03:54Z

@pai4451 DS inference server is working now. I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now. Instructions are in README

Do you think the code bloom-ds-server.py can be run on two nodes? With my current hardware limit, I have to use two nodes to accommodate the entire BLOOM model.

mayank31398 · 2022-08-07T13:06:00Z

I am not sure. I have tested with 1 node with 8 x 80GB A100 GPUs. Even if you can run it on 2 nodes, the original Megatron-LM paper doesn't recommend spanning tensor parallelism across nodes.
This drastically reduces performance.

mayank31398 · 2022-08-08T17:31:39Z

I screwed up this PR
❤️
@pai4451

mayank31398 · 2022-08-08T17:34:32Z

Moving to #325

stas00 · 2022-08-08T22:52:38Z

@stas00 , i would like to contribute this to the bloom-inference branch if its all right?

well, ideally all this should go directly to https://github.com/huggingface/transformers/tree/main/examples/research_projects/bloom-inference (the last section doesn't exist yet)

so the bloom-inference branch here should be moved there as well.

Does your code depend on the script under the bloom-inference branch? If not, perhaps open a separate PR into transformers and tag me on it?

At some point I will be doing the same for the bloom-inference branch

mayank31398 · 2022-08-09T12:35:20Z

Well no @stas00 , but it has a lot of duplicate code for now. That's why re-using the same methods across scripts would be better.
Also, I am not able to use DS-inference with batch size > 1, I still get illegal memory access. After the DS fix, batch size = 1 started working.

Is it possible this is cause by CUDA version = 11.6 ( i am using).
What is the CUDA environment used by you guys?
Also, is Pytorch built from source and which version?

mayank31398 · 2022-08-09T12:37:59Z

Also, the memory leak in HF accelerate is not seen by @sgugger , so not sure why it is happening with my environment.

stas00 · 2022-08-09T17:16:29Z

I suppose we could start turning the scripts into small libraries that the scripts would pull in.

Would it help if I merged the bloom-inference branch, you re-based it and then started converting scripts into libs and re-using the code?

stas00 and others added 30 commits July 28, 2021 10:02

fix timing (#31)

19cde92

Update gpt2_tokenization.py

a0bccfe

Revert "Update gpt2_tokenization.py"

ac227a1

This reverts commit a40d816.

use pp engine even for pp=1 (#6) (#34)

2a7ee91

Co-authored-by: Jeff Rasley <[email protected]>

Revert "use pp engine even for pp=1 (#6) (#34)"

e7bc518

This reverts commit 6c6c64a.

Revert "Revert "use pp engine even for pp=1 (#6) (#34)""

2a64b19

This reverts commit ad2e3d3.

Create README.md

c3399a7

add a section on how we use deepspeed with Meg

a7856ca

fix the deepspeed example

0366587

add .bs to the version to help check we are on the right repo/branch

b3aa039

fix attn_mask (#50)

e7ac5fd

chore: update gitignore (#45)

358eac6

rm (s) that slipped through

e0c6236

Update requirements.txt (#46)

72f4080

* chore: update requirements.txt * chore: rm deepspeed README already specifies this in greater detail.

Update README.md (#51)

5521f38

Fix markdown formatting

chore: add deepspeed as comment

c43e207

Fix pretrain_gpt_single_node example script to have only one occurenc…

6e5f752

…e of lr-decay-style

better comment on TB writer (is_last_rank)

190565d

[microsoft/Megatron-DeepSpeed sync] Commits including 2021-08-09 (#58)

febe21d

* Use new zero.Init() API (#10) * query deepspeed global grad norm (#8) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Shaden Smith <[email protected]>

Add test suite (#64)

3c6460d

* add test suite * add test suite

fix arg help (#65)

29f0150

add testing and contribute info

128013d

fix header

60e82e3

fix: doc_idx offset when merging indexed dataset files (#66)

ccab405

fix bug

b83b39d

new code

031c0ee

Mayank Mishra added 3 commits August 7, 2022 16:00

working code

450eb1f

fix bug

fc5d383

update readme

39bab5c

mayank31398 marked this pull request as ready for review August 7, 2022 12:08

mayank31398 mentioned this pull request Aug 7, 2022

where can I download the 176B checkpoint in deepspeed format? #319

Open

Mayank Mishra added 6 commits August 7, 2022 22:44

increase batch size for HF accelerate

303a6cf

increase batch size

69d5cf2

support dynamic batch size with deepspeed

d816f59

drop num tokens

73a79d2

drop return type

89dfe23

oom

03abce7

mayank31398 closed this Aug 8, 2022

mayank31398 deleted the generation-server branch August 8, 2022 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation server using HF accelerate and DS inference #321

Generation server using HF accelerate and DS inference #321

mayank31398 commented Jul 30, 2022

mayank31398 commented Aug 4, 2022

pai4451 commented Aug 5, 2022 •

edited

Loading

mayank31398 commented Aug 6, 2022

mayank31398 commented Aug 6, 2022 •

edited

Loading

pai4451 commented Aug 7, 2022 •

edited

Loading

mayank31398 commented Aug 7, 2022

mayank31398 commented Aug 7, 2022

pai4451 commented Aug 7, 2022 •

edited

Loading

mayank31398 commented Aug 7, 2022

mayank31398 commented Aug 8, 2022

mayank31398 commented Aug 8, 2022

stas00 commented Aug 8, 2022

mayank31398 commented Aug 9, 2022

mayank31398 commented Aug 9, 2022

stas00 commented Aug 9, 2022 •

edited

Loading

Generation server using HF accelerate and DS inference #321

Generation server using HF accelerate and DS inference #321

Conversation

mayank31398 commented Jul 30, 2022

mayank31398 commented Aug 4, 2022

pai4451 commented Aug 5, 2022 • edited Loading

mayank31398 commented Aug 6, 2022

mayank31398 commented Aug 6, 2022 • edited Loading

pai4451 commented Aug 7, 2022 • edited Loading

mayank31398 commented Aug 7, 2022

mayank31398 commented Aug 7, 2022

pai4451 commented Aug 7, 2022 • edited Loading

mayank31398 commented Aug 7, 2022

mayank31398 commented Aug 8, 2022

mayank31398 commented Aug 8, 2022

stas00 commented Aug 8, 2022

mayank31398 commented Aug 9, 2022

mayank31398 commented Aug 9, 2022

stas00 commented Aug 9, 2022 • edited Loading

pai4451 commented Aug 5, 2022 •

edited

Loading

mayank31398 commented Aug 6, 2022 •

edited

Loading

pai4451 commented Aug 7, 2022 •

edited

Loading

pai4451 commented Aug 7, 2022 •

edited

Loading

stas00 commented Aug 9, 2022 •

edited

Loading