Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation server using HF accelerate and DS inference #321

Closed
wants to merge 1,165 commits into from
Closed

Generation server using HF accelerate and DS inference #321

wants to merge 1,165 commits into from

Conversation

mayank31398
Copy link
Collaborator

This PR depends on
There are some redundant methods in some scripts that can be removed once #308 is merged into main branch
This PR is for adding scripts for creating a generation server using both HF accelerate and DeepSpeed inference.

stas00 and others added 30 commits July 28, 2021 10:02
* Propose a faster preprocessing mechanim by reducing the interprocesses communications

* Add flush in order to force print

* Try to prevent dead locks

* Woops

* Trying to figure out what causes deadlock

* Limit queue size to 1_000_000

* Drastically reduce the maximum number of element in the queue

* Threading does not use a worker

* Remove shard files and factorise shard naming

* Document high number of worker preprocessing script

* Improve naming

* Update comments and readmes

* Woops

* Remove the notion of vanilla and point to the script instead

* Rephrase readme to use around 60 cores instead of 40

Co-authored-by: Thomas <ö[email protected]>
* Training groupings

* validation grouping

* steps vs samples

* iteration time (speed -> samples or iterations per second)

* tensorboard group time (from `log_timers_to_tensorboard`)

* comment on the writing condition

* Update megatron/global_vars.py

Co-authored-by: Stas Bekman <[email protected]>

* Update megatron/training.py

Co-authored-by: Stas Bekman <[email protected]>

* Update megatron/training.py

Co-authored-by: Stas Bekman <[email protected]>

* Update megatron/training.py

Co-authored-by: Stas Bekman <[email protected]>

* Update megatron/training.py

Co-authored-by: Stas Bekman <[email protected]>

* link bug fix issue on megatron-lm side

Co-authored-by: Stas Bekman <[email protected]>
* chore: update requirements.txt

* chore: rm deepspeed

README already specifies this in greater detail.
* Update gpt2_tokenization.py

Adding LRU cache and speeding up tokenization.

* Update gpt2_tokenization.py

Removing _old method. Note that the chinese token processing is optional and not used currently in training.

* Update gpt2_tokenization.py

* Update preprocess_data.py

The path needs to be set before we can find the "megatron" package.

* Update gpt2_tokenization.py

Adding comments about max_token_len_cache

* Update megatron/tokenizer/gpt2_tokenization.py

Co-authored-by: Thomas Wang <[email protected]>

* Update gpt2_tokenization.py

* Update megatron/tokenizer/gpt2_tokenization.py

Co-authored-by: Stas Bekman <[email protected]>

* Update gpt2_tokenization.py

* Update gpt2_tokenization.py

Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Fix markdown formatting
* feat: add glu variant activations

* fix: rm extraneous parentheses

* feat: rm bias to support jit

* fix: replace negative dim with explicit dim

* fix: use `x.ndim` for generic dim handling

* docs: add note on version for posterity

Co-authored-by: Stas Bekman <[email protected]>

* docs: specify jit in `x.ndim` comment

Co-authored-by: Stas Bekman <[email protected]>

* test: add simple tests to check activations

* fix: use `torch.testing` for tensor checks

* test: use seed-controlled random batch inputs

Co-authored-by: Stas Bekman <[email protected]>
* Use new zero.Init() API (#10)

* query deepspeed global grad norm (#8)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
* indexed_dataset: use numpy to compute byte offsets faster

* preprocess with huggingface datasets and mpi

* preprocess_dataset_mpi: add --shuffle and --seed options

* indexed_dataset: fix to handle file with 0 items

* preprocess_dataset_mpi: add --split and --count options

* update script comments to reflect shuffle behavior

* add torch.distributed version

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* add estimated progress logging

* avoid downloading dataset unless user really wants to

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* refactor main into more functions

* reformat progress messages

* move mpi4py import test to get_args

* drop Open MPI variables from init_process_group

* add --local_rank to support torch.distributed.launch

* update from DeepSpeedExamples

* raise exceptions on errors

* drop --download option

* format byte rate as MB/s

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* move datasets import back to top

* import config from datasets

Co-authored-by: Thomas Wang <[email protected]>
* add test suite

* add test suite
…#63)

* shuffle index list with numpy, scatter list, use file for large lists

* drop unused idx_end from index scatter

* drop scatter list file to simplify, can add back if needed

* rework scatterv, recompute num_samples when needed

* Update tools/preprocess_dataset_mpi.py

Co-authored-by: Thomas Wang <[email protected]>

* fix spacing

Co-authored-by: Thomas Wang <[email protected]>
@mayank31398
Copy link
Collaborator Author

@pai4451 I am open to suggestions if you have.
I was thinking of running a server for each of the 8 processes and the 0th process will receive a generate request from the user and call the server of other processes.
But I am not sure how much overhead this would cause.

@pai4451
Copy link

pai4451 commented Aug 5, 2022

@mayank31398 Thanks, I’m also working on serving BLOOM with DeepSpeed. I think this solution might work, but in terms of serving we have to consider the maintenance cost. The difficult part I think is to keep all processes stable (alive and synchronized).

@mayank31398
Copy link
Collaborator Author

@pai4451 can you give this latest code a try?
I am able to run the server but the code gets stuck on model.generate()
I don't really understand why

@mayank31398
Copy link
Collaborator Author

mayank31398 commented Aug 6, 2022

What the current code is doing:
It creates 8 servers, 1 on the main HOST:IP and other 7 on 127.0.0.1:IP+1, 127.0.0.1:IP+2, ...
The main server sends call to other 7 servers to run the generate method.
I see that the code just gets stuck in line 165 after first request is sent.

The code is working upto line 164 though (when a request is sent). I see tokenized input on all 8 processes.

@pai4451

@pai4451
Copy link

pai4451 commented Aug 7, 2022

@pai4451 can you give this latest code a try? I am able to run the server but the code gets stuck on model.generate() I don't really understand why

@mayank31398 I also get stuck on the line model.generate(). Maybe some processes failed to communicate with others or the processes are not synchronized? I doubt the way to launch the server via deepspeed might cause some processes communication problems.

@mayank31398
Copy link
Collaborator Author

@pai4451 DS inference server is working now.
I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now.
Instructions are in README

@mayank31398
Copy link
Collaborator Author

@stas00 , i would like to contribute this to the bloom-inference branch if its all right?
Currently, the scripts are only working with batch size = 1
2 scripts have been added (with a little code refactoring of other scripts).
I am working on increasing the batch size 🤗 now.

@pai4451
Copy link

pai4451 commented Aug 7, 2022

@pai4451 DS inference server is working now. I have deployed using DeepSpeed MII. This is a new library just released by the DeepSpeed team. ❤️

You can use the scripts now. Instructions are in README

Do you think the code bloom-ds-server.py can be run on two nodes? With my current hardware limit, I have to use two nodes to accommodate the entire BLOOM model.

@mayank31398
Copy link
Collaborator Author

I am not sure. I have tested with 1 node with 8 x 80GB A100 GPUs. Even if you can run it on 2 nodes, the original Megatron-LM paper doesn't recommend spanning tensor parallelism across nodes.
This drastically reduces performance.

@mayank31398
Copy link
Collaborator Author

I screwed up this PR
❤️
@pai4451

@mayank31398 mayank31398 closed this Aug 8, 2022
@mayank31398 mayank31398 deleted the generation-server branch August 8, 2022 17:31
@mayank31398
Copy link
Collaborator Author

Moving to #325

@stas00
Copy link
Contributor

stas00 commented Aug 8, 2022

@stas00 , i would like to contribute this to the bloom-inference branch if its all right?

well, ideally all this should go directly to https://github.com/huggingface/transformers/tree/main/examples/research_projects/bloom-inference (the last section doesn't exist yet)

so the bloom-inference branch here should be moved there as well.

Does your code depend on the script under the bloom-inference branch? If not, perhaps open a separate PR into transformers and tag me on it?

At some point I will be doing the same for the bloom-inference branch

@mayank31398
Copy link
Collaborator Author

Well no @stas00 , but it has a lot of duplicate code for now. That's why re-using the same methods across scripts would be better.
Also, I am not able to use DS-inference with batch size > 1, I still get illegal memory access. After the DS fix, batch size = 1 started working.

Is it possible this is cause by CUDA version = 11.6 ( i am using).
What is the CUDA environment used by you guys?
Also, is Pytorch built from source and which version?

@mayank31398
Copy link
Collaborator Author

Also, the memory leak in HF accelerate is not seen by @sgugger , so not sure why it is happening with my environment.

@stas00
Copy link
Contributor

stas00 commented Aug 9, 2022

I suppose we could start turning the scripts into small libraries that the scripts would pull in.

Would it help if I merged the bloom-inference branch, you re-based it and then started converting scripts into libs and re-using the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.