-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO #308
Conversation
with a single A100 it fails with:
|
It is very weird, let me try it on my side and I will push a fix soon. |
I used my own script and could run this fine, I am gonna try it with yours and see if it works. |
I also produced good results with your script on my side: |
I found an issue with your script when running with mlti-GPU, that it results in illegal memory access. I push a fix now. |
So, I realize I cannot push here. But the change is simple and you can do it on your side. Just change the mp_size from 1 to world_size when passing to init_inference:
|
Btw, is there any chance you may have running this on 2 GPUs. Can you please retry this with using |
I sent you an invite to this repo yesterday, please check your emails from github The problem was that I didn't built a kernel for this card, I was able to see that through adding
Fixed that now, passed that point (and yes, mp_size should be fixed as well, but was fine for 1 gpu) - thank you! next it's now crashing here:
|
Hi @stas00, Cany you please see if you can run this without kernel injection? Just, remove the |
Reza figured it out - it was a bug in
|
Status update - after many hard working days and nights everything works fast and great! Reza++! Let's generate some text:
in=DeepSpeed is a machine learning framework out=DeepSpeed is a machine learning framework that is designed to be used by researchers and developers who are interested in applying deep learning to their own problems. It is a Python library that provides a set of tools for training and evaluating deep neural networks. It is designed to be easy to use and to provide a flexible environment for experimentation. DeepSpeed is built on top of the Caffe deep learning framework, and it provides a set of tools for training and evaluating deep neural networks. It is designed to |
OK, instrumented the script to run various benchmarks, so 8x80 a100:
While processing memory per process:
Radical! @RezaYazdaniAminabadi run the same on 2x8x40GB A100 and it was 50msec per token. The slowness is due to internode communication. So the slowdown will depends on the internode connectivity - faster network will lead to faster throughput. |
@RezaYazdaniAminabadi
|
Thank you for the full traceback, @xuyifanbupt - as we have 3 different unrelated implementations in this PR could you please create a new issue for the one you reported - as it's with Accelerate and most of this thread is debugging the ds-inference one. and tag @stas00 and @sgugger on it. Please repaste your script and the full traceback and the rest of the notes you shared. Thank you! |
Hi @mayank31398 You should be using the master branch on both DeepSpeed and HuggingFace. Just note that with Thanks, |
I am still getting this issue in when I have installed DeepSpeed and HF on master branch. I tried with NCCL_DEBUG=INFO
I see this just before the error traceback @RezaYazdaniAminabadi yes, this is with batch size 1 |
Also @RezaYazdaniAminabadi can you point out the branch which might contain the fix for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bloom-fix branch in DeepSpeed has been merged into master.
And the bloom-ds-inference.py is working fine now.
Thanks for this PR.
Can we merge this? |
I don't think this is going to go here, since it has nothing to do with Meg-DS. I was just using it as a dev PR so that the Deepspeed team could push directly into it. Once things are stable and there are still some issues to resolve on the server side, we will merge this into |
There is already an inferencing script in main branch under the same directory. Not sure if it works |
@stas00 , I have currently deployed BLOOM in server mode using accelerate with batch size = 1 The scripts can be found here: #325 |
Perhaps there is a memory leak somewhere? Let's ask @sgugger, as he developed
Not at all, you don't want to do that in production most of the time, unless there is a special situation where you want to control memory freeing. I was just using it to see how much real memory was used - since pytorch tends to cache memory. so |
It was my bad, I pushed the initial version into |
I would like to add generation server scripts from #325 to this branch |
I am still seeing an illegal memory access error for batch size = 2 @RezaYazdaniAminabadi @stas00 |
@jeffra
|
@stas00 , I see the memory configuration is [0, 51, 51, 51, 51, 51, 51, 51]. |
The only reason to keep the first gpu unallocated with model weights is to allow for a much higher BS. If you don't need the higher BS, you don't need to do that. |
wrt merging - let me merge this PR for now. then rebase your PR and refactor, then we will see how to proceed. then later we will move the whole thing into I gave you write access to the repo, so it'll be easier for you to contribute. Just please don't push directly w/o a PR and wait for at least one approval from another members before merging. |
* Reshape deepspeed checkpoint (#239) * Reshape deepspeed checkpoint * add checkpoint tests * Validate input folder * Tests for tp/pp reshape * remove debug folders * fix test_checkpoint_reshaping_empty_dir * Fix unit tests * Remove deepspeed checkpoint utils * Use DS 3D reshaping utils * convert to bf16 * wip universal chkpt * rename * rename * wip on fragments dealing * cleanup * Loading universal checkpoint with reshaping * all gpu1<->2 reshapes work * param attrs * make the tests adaptable to the number of available gpus * WIP * WIP * WIP * WIP * Debug functions * args should be required, don't create another latest file * Parallelize shard extraction * close+join pool; add tqdm; comment out noise * rename * parameterize * Parallel slice merging * Cleanup * allow inspection on a machine w/o gpus * test against the right DS branch * DS size was merged Co-authored-by: Stas Bekman <[email protected]> * BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO (#308) * hardcode the dtype depending on the model * change the mp based on the world_size * remove hardcoded world_size * add bigscience/bigscience-small-testing * fixes * add zero-inference script * fixes * fix * working script * renames * fixes * fix for offline use * add benchmark * add benchmark * update * cleanup * update * msecs * cleanup * improve * fix benchmark, add warmup * update * fix; thanks Michael Wyatt * clarify * add bloom batch-inference script * removed the names :-) * fold the bs functionality from the other script * fix * restore do_sample * dump generate args * fix * fix * support any batchsize * div by bs * mul by bs * add cpu_offload; sync scripts * wip * improvements * fixes * fixes * add accelerate script * fix * wip * wip * stats * add OnDevice and remove zero-inference (#316) * wip * rework generate + benchmark * figure out the memory map dynamically * bug fix * fix ds-zero-inference wrt device * bug fix * update * update * fix Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
bigscience-workshop#308) * hardcode the dtype depending on the model * change the mp based on the world_size * remove hardcoded world_size * add bigscience/bigscience-small-testing * fixes * add zero-inference script * fixes * fix * working script * renames * fixes * fix for offline use * add benchmark * add benchmark * update * cleanup * update * msecs * cleanup * improve * fix benchmark, add warmup * update * fix; thanks Michael Wyatt * clarify * add bloom batch-inference script * removed the names :-) * fold the bs functionality from the other script * fix * restore do_sample * dump generate args * fix * fix * support any batchsize * div by bs * mul by bs * add cpu_offload; sync scripts * wip * improvements * fixes * fixes * add accelerate script * fix * wip * wip * stats * add OnDevice and remove zero-inference (bigscience-workshop#316) * wip * rework generate + benchmark * figure out the memory map dynamically * bug fix * fix ds-zero-inference wrt device * bug fix * update * update * fix Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
update: I expanded the PR to include accelerate and deepspeed ZeRO - please see README for full details
This PR is sorting out the inference script for BLOOM via DeepSpeed-Inference microsoft/DeepSpeed#2083
I pushed the main script into
main
already, so this is just the fixes of that script.setup transformers
make sure you are on the latest
transformers@main
setup DeepSpeed
Get the DS master branch
setup Meg-DS
run the script:
adapt to number of wanted gpus, use the larger models if needed.
p.s. also added zero3-inference script
but you must edit the nvme path and this one is super-slow - but it works and requires only 1x 24GB GPU! and not 8x80GB :)
On JZ you must do:
(I already pre-cached
bigscience/bloom-350m
andbigscience/bloom
so it should work from offline mode.@RezaYazdaniAminabadi, @jeffra