-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero_optimization.cpu_offload: true leads to a silent crash #610
Comments
@stas00, thanks for reporting this issue. Can you please set |
Thank you, @tjruwase Would you please add an assert to help the user to know that? Surely silent exit w/o a traceback must be incomplete, right? When I did that - it moves a long and then crashes:
same config as in OP but changed to |
I think I see the issue, based on your stack trace.
Can you please call |
Ah, right, thank you, @tjruwase - I switched to Any ideas why and what do I need to change to make it work? Talking about batch size - when does
|
@stas00 Thanks for the update. It appears that you running into usability gaps, so apologies for the inconvenience, but hopefully this will enable overall improvement for other users. For zero-offload memory usage, can you please check out this #467 to see if it is relevant? In the meantime, I will look up an explanation DeepSpeed's handling of |
Please see #131 for a discussion on batch sizes. |
This was a perfect reference, thank you!
the gist of that thread was:
So the setting I was trying was and getting:
I guess it divided it by 2 gpus and rounded down to 0, right? It'd probably be a defensive tactic and preventing this kind of questions - for the code to test that the derived
(I haven't looked at the actual variable names, this is just a proof of concept) but if it was preconfigured it should check it's at least 1. |
and another assert that is needed that we discusses elsewhere is:
|
Going back to
I'm just not sure what to do with the `"train_batch_size": in the config file then. I am not allowed not to have it there. What would be your recommendations? |
I will put the required asserts on my TODOs, but will appreciate a PR as well if you have the bandwidth. There is currently no benefit to having deepspeed manage datasets, and our largest examples, bing_bert and megatron-lm, manage their own datasets. So I think your current approach with HF trainer is fine. In future, we might explore innovations in data loading, so watch this space. Regarding In reality, |
The problem is that I'm unfamiliar with those parts of code, it'd be very quick for someone who knows the code to add, and it's not urgent now that you told me what the right mix is. Thank you for adding those at some point in the future, @tjruwase
Excellent. Thank you for confirming that! A related question: what about the (Please let me know if it'd better to open a separate issue w/ this question to make it easy to per-use your answers in the future)
Awesome! Thank you for the explicit details! That's very helpful! The problem with your suggestion is this: We tweak all the cl args at the command line (and most ML programs have 100s of those), so remembering to change a config file to match the same cl args is very error-prone and will lead to a lot of mistakes/wasted time. Tweaking BS is most prevalent when dealing with OOM. So having a mismatch would be very error prone. Could ds reuse some of the args to derive that info from e.g. Thinking more about it, I think BS is a special case, since half of it is handled by the host program and the other half by DS - so there should be one place where it's set. So if no other cl args from the host program are supported, I strongly believe that this one should be supported. |
A few thoughts on cl args vs. config file.
|
Thank you for sharing your detailed notes on the design decisions with regards to the the configuration process. I do agree that having hundreds of cl args can be difficult to manage. And your config file is a way easier to parse visually. Here is feedback on using the config file approach so far:
This is diverging from this issue though - please let me know if you'd like to continue this discussion and then I should make a separate issue.
Oh, it's undocumented. Awesome - I'm going to try it - at the moment just passing the batch size would be fantastic. I will keep you posted once I sort it out. p.s. I just can't say enough how much we appreciate your amazing support - it's absolutely outstanding and very helpful at quickly moving along at integrating your magic engine! |
So if I try to pass
(Grr, I see these are two 2 traces intermixed. Any practical suggestions at how to deal with such situations in general? I guess there should be a way to signal the logger to log only for local_rank=0?) I did:
Does it expect the whole config instead of the config file? I though I could pass the config file for everything, but As we discussed earlier BS is special when it's needed by both DS and the host application (when DS isn't handling datasets). |
And another question: We have:
which can be quite different. How should we handle this with deepspeed? As I see there is only one batch_size var and its The evaluation is in general less memory demanding due to not needing gradients, yet it can be quite memory hungry due to beam search and require more memory than training. |
Yes, DeepSpeed currently does not support eval/inference execution. The reason is that training was previously the memory and computation intensive part. And so our current examples do evaluation outside of DeepSpeed. But recently, we have seen requests to add support for evaluation/inference. So it is now on our short-term timeline. |
Oh, I was completely unaware of this. So I need to do more work then to undo deepspeed post-training stage. I suppose that I need to remove DeepSpeed layer from DDP(DeepSpeed(OriginalModel)) stack, so it becomes DDP(OriginalModel) as soon as training is done. So something like:
Also need to make sure that there are no memory-holding left-overs from deepspeed, so that all of GPU is available again. You don't think parts of ZeRO could be of help during eval? I guess there aren't many moving parts as it's then quite localized to each GPU. I'm thinking perhaps ZeRO memory management could save some memory there. But I could be wrong. |
@stas00 I typically just retain reference to my original
|
Thank you for your comments, @g-karthik. Yes, we are recoding things now so that we have:
I suspect we must clear out any references to deepspeed to free the gpu memory if we aren't using it at later stages, but I could be wrong and it already happens automatically. I will discover that once I get a chance to implement it as I didn't realize I needed to remove DS for the evaluation stage, since it worked just fine with it. Perhaps I don't need to do anything at all. |
@stas00 I think you'll probably need to free some references for evaluation when the model size is very large, such as perhaps T5-11B. We can discuss this on your |
@g-karthik Thanks for sharing your experience with using DeepSpeed for evaluation. @stas00, I will gladly defer to @g-karthik on this topic, as he is quite more knowledgeable than me :). |
We are much appreciating you too offering to support our DS integration process, @g-karthik! |
@tjruwase, I got a chance to experiment with your suggestions and it is mostly working with some TLC needed for
At the moment here is what I came up with: (Reminding I'm trying to solve the problem of needing to get some config values through HF trainer cl args, but the bulk of it via
|
@stas00 Happy New Year. Apologies for the radio silence, I finally succumbed to the holidays. I want to resume the integration effort starting with this issue of deepspeed.initialize required |
Happy New Year to you too, @tjruwase! Yes, it is hard to do co-operative work when those one cooperates with aren't there ;) |
I just submitted a fix and unit test. Thanks for catching this bug. |
I checked that it is no longer required - thank you for the quick fix. |
I'm experimenting with various
zero_optimization
config options and I noticed that when I flip totrue
zero_optimization.cpu_offload
, the application exits w/o crashing or doing any training.leads to a silent exit but doing nothing:
Full log
If I flip
zero_optimization.cpu_offload
tofalse
everything works:Full log
I know I haven't provided reproduction info, as I haven't quite finished working on integration with HF
transformers
, but it should be ready soon. I was hoping you could tell from logs what went wrong. But if it isn't helpful I will update this Issue with reproduction details once I have a transformers branch you could experiment with.The text was updated successfully, but these errors were encountered: