Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to fix the content about parallelformers in README. #81

Closed
hyunwoongko opened this issue Jun 6, 2023 · 1 comment
Closed

Request to fix the content about parallelformers in README. #81

hyunwoongko opened this issue Jun 6, 2023 · 1 comment

Comments

@hyunwoongko
Copy link

hyunwoongko commented Jun 6, 2023

Hello! Thanks for great work.
I am an author of parallelformers, and I saw you mentioned parallelformers in README.

parallelformers implements a fixed [list of architectures](https://github.com/tunib-ai/parallelformers/tree/main/parallelformers/transformers)

First, I would like you to fix the link to https://github.com/tunib-ai/parallelformers/tree/main/parallelformers/policies. In fact parallelformers supports more than 60 architectures with pre-defined policies and users also can parallelize unsupported models with their own policy class if want like this. But in the link you mentioned, there are only parts of total supported models. It's not good to inform incorrect information to users.


Second, I found this code and #45. And you mentioned like 'this library supports any architecture automatically' in your README. Then why are you creating these configs even though tensor_parallel library can split any model 'automatically'? (and currently there are only 7 pre-defined configs. am I right?)

I am really interested in how could you implement tensor model parallelism for any model architecture automatically. I discussed this with MS DeepSpeed and HF transformers team for a year, and we couldn't find appropriate method for automation. We've tried to automate this process for some models, but it couldn't parallelize ALL models stably. because there are so many model architectures in HF transformers and some models have different structures with other models. (You can see this the discussion here) I also wonder how many models did you test with your library.

That's why parallelformers and deepspeed were implemented with model-specific policy classes. You can see injection_policy argument here. If you found better and stable methods than us, I would like to hear its mechanism and learn from you.


Third, we are creating new a library, https://github.com/EleutherAI/oslo for distributed model training and it supports TP, PP, DP, ZeRO, MoE and theirs mixture. (you can see the example here) parallelformers was developed with producer-consumer architecture for web server deployment, so it is not appropriate for model training. After developing the parallelformers, I decided to create a new library for model training. If you are interested in to connect with us, please feel free to let me know. I am interested in adding your own great TP algorithm to our library.

Thanks.

@hyunwoongko hyunwoongko changed the title Parallelformers in README. Request to fix the content about parallelformers in README. Jun 6, 2023
@BlackSamorez
Copy link
Owner

Thank you for the comment! I was also following oslo closely, and it looks really promising!
I'll update the link and reformulate the sentence so it's accurate.

Then why are you creating these configs even though tensor_parallel library can split any model 'automatically'?

tensor_parallel can automatically partition linear and convolutional layers inside a model, which gives a decent - but not perfect - tensor parallelism efficiency. For some models that are I use frequently, I've implemented custom configs that offer somewhat better performance in some cases.

... because there are so many model architectures in HF transformers and some models have different structures with other models. ...That's why parallelformers and deepspeed were implemented with model-specific policy classes.

I agree — if you want TP to reach deepspeed levels of efficiency, it's really difficult to support arbitrary models. I've designed tensor_parallel to be generic at the cost of leaving some efficiency on the table. This is because the main use case of tensor_parallel is so a researcher can quickly play with a new architecture that doesn't fit on one GPU, without having to worry about how to parallelize it. I've designed it for myself, really — because for me it was often the case that running an experiment today is more important than running it at peak efficiency. In other words, it seems that we address slightly different use cases.

As for helping to add the TP algorithm to your library, I'd be happy to help, but I'm kind of busy right now. Maybe someday later.
Best of luck with your library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants