-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generation server using HF accelerate and DS inference #321
Generation server using HF accelerate and DS inference #321
Conversation
This reverts commit a40d816.
Co-authored-by: Jeff Rasley <[email protected]>
* Propose a faster preprocessing mechanim by reducing the interprocesses communications * Add flush in order to force print * Try to prevent dead locks * Woops * Trying to figure out what causes deadlock * Limit queue size to 1_000_000 * Drastically reduce the maximum number of element in the queue * Threading does not use a worker * Remove shard files and factorise shard naming * Document high number of worker preprocessing script * Improve naming * Update comments and readmes * Woops * Remove the notion of vanilla and point to the script instead * Rephrase readme to use around 60 cores instead of 40 Co-authored-by: Thomas <ö[email protected]>
* Training groupings * validation grouping * steps vs samples * iteration time (speed -> samples or iterations per second) * tensorboard group time (from `log_timers_to_tensorboard`) * comment on the writing condition * Update megatron/global_vars.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * Update megatron/training.py Co-authored-by: Stas Bekman <[email protected]> * link bug fix issue on megatron-lm side Co-authored-by: Stas Bekman <[email protected]>
* chore: update requirements.txt * chore: rm deepspeed README already specifies this in greater detail.
* Update gpt2_tokenization.py Adding LRU cache and speeding up tokenization. * Update gpt2_tokenization.py Removing _old method. Note that the chinese token processing is optional and not used currently in training. * Update gpt2_tokenization.py * Update preprocess_data.py The path needs to be set before we can find the "megatron" package. * Update gpt2_tokenization.py Adding comments about max_token_len_cache * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Thomas Wang <[email protected]> * Update gpt2_tokenization.py * Update megatron/tokenizer/gpt2_tokenization.py Co-authored-by: Stas Bekman <[email protected]> * Update gpt2_tokenization.py * Update gpt2_tokenization.py Co-authored-by: Thomas Wang <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
Fix markdown formatting
…e of lr-decay-style
* feat: add glu variant activations * fix: rm extraneous parentheses * feat: rm bias to support jit * fix: replace negative dim with explicit dim * fix: use `x.ndim` for generic dim handling * docs: add note on version for posterity Co-authored-by: Stas Bekman <[email protected]> * docs: specify jit in `x.ndim` comment Co-authored-by: Stas Bekman <[email protected]> * test: add simple tests to check activations * fix: use `torch.testing` for tensor checks * test: use seed-controlled random batch inputs Co-authored-by: Stas Bekman <[email protected]>
* Use new zero.Init() API (#10) * query deepspeed global grad norm (#8) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Shaden Smith <[email protected]>
* indexed_dataset: use numpy to compute byte offsets faster * preprocess with huggingface datasets and mpi * preprocess_dataset_mpi: add --shuffle and --seed options * indexed_dataset: fix to handle file with 0 items * preprocess_dataset_mpi: add --split and --count options * update script comments to reflect shuffle behavior * add torch.distributed version * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * add estimated progress logging * avoid downloading dataset unless user really wants to * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * refactor main into more functions * reformat progress messages * move mpi4py import test to get_args * drop Open MPI variables from init_process_group * add --local_rank to support torch.distributed.launch * update from DeepSpeedExamples * raise exceptions on errors * drop --download option * format byte rate as MB/s * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * move datasets import back to top * import config from datasets Co-authored-by: Thomas Wang <[email protected]>
* add test suite * add test suite
…#63) * shuffle index list with numpy, scatter list, use file for large lists * drop unused idx_end from index scatter * drop scatter list file to simplify, can add back if needed * rework scatterv, recompute num_samples when needed * Update tools/preprocess_dataset_mpi.py Co-authored-by: Thomas Wang <[email protected]> * fix spacing Co-authored-by: Thomas Wang <[email protected]>
@pai4451 I am open to suggestions if you have. |
@mayank31398 Thanks, I’m also working on serving BLOOM with DeepSpeed. I think this solution might work, but in terms of serving we have to consider the maintenance cost. The difficult part I think is to keep all processes stable (alive and synchronized). |
@pai4451 can you give this latest code a try? |
What the current code is doing: The code is working upto line 164 though (when a request is sent). I see tokenized input on all 8 processes. |
@mayank31398 I also get stuck on the line model.generate(). Maybe some processes failed to communicate with others or the processes are not synchronized? I doubt the way to launch the server via |
@pai4451 DS inference server is working now. You can use the scripts now. |
@stas00 , i would like to contribute this to the bloom-inference branch if its all right? |
Do you think the code |
I am not sure. I have tested with 1 node with 8 x 80GB A100 GPUs. Even if you can run it on 2 nodes, the original Megatron-LM paper doesn't recommend spanning tensor parallelism across nodes. |
I screwed up this PR |
Moving to #325 |
well, ideally all this should go directly to https://github.com/huggingface/transformers/tree/main/examples/research_projects/bloom-inference (the last section doesn't exist yet) so the bloom-inference branch here should be moved there as well. Does your code depend on the script under the bloom-inference branch? If not, perhaps open a separate PR into transformers and tag me on it? At some point I will be doing the same for the bloom-inference branch |
Well no @stas00 , but it has a lot of duplicate code for now. That's why re-using the same methods across scripts would be better. Is it possible this is cause by CUDA version = 11.6 ( i am using). |
Also, the memory leak in HF accelerate is not seen by @sgugger , so not sure why it is happening with my environment. |
I suppose we could start turning the scripts into small libraries that the scripts would pull in. Would it help if I merged the bloom-inference branch, you re-based it and then started converting scripts into libs and re-using the code? |
This PR depends on
There are some redundant methods in some scripts that can be removed once #308 is merged into main branch
This PR is for adding scripts for creating a generation server using both HF accelerate and DeepSpeed inference.