-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generation server scripts using HF accelerate and DS-inference #328
Conversation
This is still WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for working on refactoring and the server code, @mayank31398
- let's create libraries rather than having direct scripts importing from the server scripts, so both types import the same library. So probably the following structure:
bloom-accelerate-inference.py # library sharing code
bloom-accelerate-inference-cli.py # the original script importing from bloom-accelerate-inference.py
bloom-accelerate-inference-server.py # the server script importing from bloom-accelerate-inference.py
same for other sets. e.g., bloom-deepspeed-inference.py
, bloom-deepspeed-zero-inference.py
, ...
This is of course just a suggestion, I'm happy to hear other ideas.
5bc3d84
to
d1676fc
Compare
@stas00, I believe the code is in good shape. You can start reviewing. |
792cda0
to
8f25200
Compare
Oh wow, you developed a whole package around this. Impressive work, @mayank31398 Let's just discuss the user-facing APIs before implementing to minimize wasting time. Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files. Some design questions:
I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud. |
Let me know your thoughts on this. Thanks for the review :) |
Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. |
so you are not benchmarking the same thing then. you don't need to broadcast anything for non-server usage. It sounds like you built The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done. Perhaps we have miscommunicated here.
I answered earlier in the code comment.
I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system. Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs. What crashes?
for end user scripts there should be no need to import them.
agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't. |
cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand. I will incorporate your suggestions. |
So, just to summarize for cli.py: |
Good question - I have never tried it - I guess it should work. |
Also, @stas00 I have been meaning to ask. |
It can but it's not using it in our scripts here. It uses a naive pipeline parallelism, assigning different layers to different devices and switching between those. |
I feel like this is the best approach for now. If we want to change this we can do that in a separate PR in the future. |
If you use Alternatively you can do:
If neither is set, You can see: |
Actually, the first line isn't needed, as it's already the default: so just:
|
DeepSpeed-MII not working with new Microsoft weights: microsoft/DeepSpeed-MII#50 |
@stas00 https://huggingface.co/bigscience/bloom/tree/main
Any idea how to get around this? Please let me know this. I am not sure of a good solution. |
it looks that only README.md was updated, so it's weird that it'd fail. I just run:
and it updated w/o any issues. Probably move that file it complains about for now? Could be some bug in it shouldn't redownload anything but the new or updated files. |
@mayank31398, please also note the renames in this commit https://github.com/microsoft/DeepSpeed/pull/2217/files/c77f5e0e385db405475b2842a6f5725c0f551170..f3f4b1dda9f9ba09042b8b63bbee475c9364bc56 in particular this is now merged so we should update our side - you can just do a global search and replace on all of the files under scripts - or if you'd prefer I do that please let me know - don't want to step on your toes if you're still on the old DS branch. |
Let me try updating to the latest HF_hub, and deepsped to see if that fixes things. |
Nevermind, I have fixed this. This was a bug because all 8 processed in my DeepSpeed code were trying to download simultaneously |
oddly enough It doesn't look like
|
@mayank31398, would it be OK if we merged this PR - and then perhaps have another iteration of tweaks in subsequent PRs if needed? As we are renaming files I want the rename to be merged first before tweaking those. You have the commit rights, so you can do the honours if you're in agreement. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this
Yes @stas00 , currently, MII is not working with DS-inference. But we can merge and I can continue to work on this. |
Also, @stas00 there are some pieces of code that both the server folder and the scripts folder can share. |
perfect, then you can start a new PR once you have some new edits. We can just do it in iterations. I need to edit the original scripts to support the new repos. will do in another PR. |
…igscience-workshop#328) * first step towards making libs * HF accelerate model * refactor accelerate * refactor DS inference * refactor DS ZeRO * make inference library * cli * server * request * remove MaxTokensError * fix batch size error with DS inference server * type fix * add latency * add latency * add min_length to default kwargs * str kwargs * str kwargs * fix comma * add old scripts back * move scripts * drop data * minor changes + add README * update README * drop nccl * fix * default values * resolve issues * handle keyboard interrupt * remove caching * use snapshot_download * make server class * fix snapshot download Co-authored-by: Mayank Mishra <[email protected]>
Currently, the scripts are working correctly.
However, there is some memory leak.
In huggingface/accelerate#614 @sgugger says that its not in accelerate which is probably true.
My hunch is a related bug: apache/mxnet#19159