-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We made a toolkit can parallelize almost all the Hugging Face models. But we have some question ! #12772
Comments
Hello @hyunwoongko, thanks a lot for sharing, this is a really cool project! No problem at all regarding the image homage (really cool logo by the way!) I'm pinging @stas00 who has led the efforts of model parallelization and DeepSpeed integration on our side and would probably be interested. Also pinging @sgugger as he has done some similar work. |
Thank you for implementing and sharing your project, @hyunwoongko, I haven't had a chance to study your project closely yet, but is it correct that you implemented tensor parallelism from Megatron? (There are many types of model parallelism and it is much easier to understand things when the generic MP term is not used, but an explicit type is described. Here is my initial attempt to map out the distinctions https://huggingface.co/transformers/master/parallelism.html) |
@stas00 |
Oh, but why have you deleted all the detailed comments you posted earlier? I was looking forward to studying those and now they are all gone. I'm puzzled. My plan was to do a feasibility study and then see if we can integrate your work into HF transformers. Just very busy with other projects to respond quickly at the moment. |
Because the article was written too hastily and too long, I decided that it would be more helpful for you to understand it by organizing it more neatly and accurately than explaining it in the issue comments. (I was going to blog soon, maybe within this week.) |
at the end I was able to cheat since github sent me all the comments ;) So I have just read those comments you deleted. It wasn't long at all, on the contrary I'd say it could use more details in places. Some images were great and some weren't super clear. So adding some words would help. And I'm very appreciating you wanted to merge this into HF transformers! That would be an amazing contribution! So bottom line, beside the leaner launcher, the core of your project is Tensor Parallel from built upon Megatron-LM, correct? this is exactly what I was planning to work on when I had free time, so your timing is perfect. Let's discuss the training side of it. I think to most users of HF transformers that would be the most important application of Tensor parallelism. So in the deleted note you mentioned that DDP-support needs to be integrated to make it work in training. That's the MPU part, right? And we probably should think about Pipeline too while building the MPU, while not implementing it just yet. Also do you think it'd be a good idea to invite @RezaYazdaniAminabadi into this process, so that gradually we can use your project's flexibility and add Deepspeed CUDA kernels speeds where possible. i.e. work together with the Deepspeed project. That's of course if Reza is interested and his superiors support the effort. We already discussed with Deepspeed to start deploying some of their kernels in the transformers (but haven't done anything yet). How do you propose we work on integrating this? Perhaps pick a few models first and work on a PR that integrates those and then in a subsequent PR work on other models? Probably leaving the optional launcher out at first and then considering it next? On a personal note: we are about to launch the first training of the Big Science project https://github.com/bigscience-workshop/ so my availability depends on that, if when we launch it all goes well, I will have more time, if not please bear with me, but I will do my best to support this integration process at least a bit at a time. If you have any questions or concerns please don't hesitate to ask. I will try to address those. |
I have sent my thoughts about collaboration to your email ([email protected]) ! |
Thank you for emailing me your notes, @hyunwoongko We need to discuss it here and not in private, since this is not my personal project. Therefore please re-paste all or just the parts that you feel are open to the public and we will continue the discussion here. |
Okay. First of all, I'm very happy to have your positive comments. Here are my thoughts.
|
Everything you shared sounds good to me, @hyunwoongko. With regards to 3D parallelism. currently the main obstacle in HF Transformers to support Pipeline Parallelism (PP) is the presence of multiple optional features that prevent the model from being convertable to I posted this earlier, could you please address this?
Practically, since you understand your code the best, please let's discuss how to approach the integration of it. Also let me add a reference to your project at https://huggingface.co/transformers/master/parallelism.html#tensor-parallelism |
I totally agree with your opinion. An interesting thing is that my former colleague was the first to implement PP on a torch. (torchgpipe) He first implemented it in a way that uses One thing I'm considering is to utilize
Yes, we need to implement training side of it. However, it seems a little difficult to use NVIDIA's mpu implementation in transformers. My idea is to leverage the mechanism of parallelformers again. It is to utilize the most of existing transformers code. When I was implementing parallelformers, I was able to successfully parallelize most models Combining DP and DDP probably requires minor changes to the existing torch implementation. As you know, with DP and DDP, same model parameters are broadcast to all GPU. And, each piece of data is sent to each GPUs. e.g.
This needs to be partitioned. If Tensor MP size is 2, we should create two partitions. e.g.
And I think that the data should be split by each partition, not by each GPU. e.g.
|
I wrote it with a little help from a translator. If you can't understand, please tell me :) Here is a first draft of the collaboration plan. Please feel free to comment. Everyone involved in the collaboration will be able to modify this plan depending on the circumstances. Step 1. Collaborate DeepSpeed and TUNiB to move Paralleformers Tensor MPThe method of replacing the existing layer uses the scalable method of parallelformers. This does not change the entire transformer layer, but a method to replace a few linear layers with a sliced linear layer or a sliced all-reduce linear layer. Since DeepSpeed's Tensor MP replaced the entire Transformer layer, it could not reflect the specific mechanism of each model. Firstly, I will implement this method and PR to DeepSpeed. (And this is what the DeepSpeed team wants me to do. refer to here) Ultimately, it's a good idea to archive parallelformers after most of the mechanisms of parallelformers are moved in DeepSpeed. It's a pity that our toolkit will be archived, but I think user accessibility is much more important because I want more people to easily use the large model. Parallelformers are less accessible compared to HF Transformers and MS DeepSpeed. Step 2. Collaborate DeepSpeed and TUNiB about fused CUDA kernelHowever, it is quite challenging to combine it with the CUDA kernel in the training process. In my opinion, it would not be difficult to implement forward pass, but the problem is backward. There is currently no backward pass implementation in the Tensor MP kernel in DeepSpeed. Because currently, Tensor MP is provided as inferences, DeepSpeed team didn't need to implement backward pass. Unfortunately, since I do not understand the CUDA code at a high level, it will be difficult for me to write the CUDA backward code myself. Therefore, collaboration with DeepSpeed should be made in this part. It would be nice if we could collaborate with DeepSpeed and discuss about backward implementation of the DeepSpeed Tensor MP kernel. If this is impossible, it may be difficult to use the CUDA kernel during the training process. Step 3. Collaborate Huggingface and TUNiB about transformersIn this step, we will add the newly implemented Tensor MP kernel by DeepSpeed and TUNiB into the HuggingFace. I think it will be similar to the Policy I implemented in parallelformers. There are two methods to add to HuggingFace side.
Step 4. Collaborate Huggingface and TUNiB about DP, DDP, PPOnce Tensor MP is done, we will be able to proceed with combining it with DP and DDP. At the same time, It would be good to consider about implementing PP using |
We probably should discuss PP elsewhere and focus in this thread on what's already working in your project. So I will give a brief overview only:
Great!
The 3 frameworks that currently provide PP as an API that I know of are fairscale, deepspeed and pytorch's recent versions - these all require Actually, the main complication of the current models, is the inputs/outputs. PP requires simple tensor variables that can be sliced at the batch dimension. HF models have a gazillion of variables that aren't tensors and thus can't be sliced. Some variables are tuples of tuples and are used as aggregates. If you'd like to see the sort of jumps through the hoops I had to go through to make it work for t5, please see:
Note that over the spring pytorch has developed a much more user-friendlier PP API, which now allows passing non-tensor variables, which should make things much easier. Most likely we will have to make stripped versions of the current models which support only the features that PP can accommodate. |
I wasn't referring to a specific MPU implementation. Deepspeed has one too. It's basically the manager of all dimensions of parallelism. The only reason I mentioned it so that we consider the future PP dimension as we develop the manager.
Then we start with just that.
Yes, that's the whole point of MPU. DP doesn't even need to know about TP, it just sees gpu0 and gpu2 - it has no idea there are more GPUs in the pipe. Each parallel dimension typically hides its existence from other dimensions, which allows things to keep simple. |
Your collaboration plans is very clear, @hyunwoongko. Thank you for your inspiration to share your work for the good of all! It's true that being part of a "bigger pie" will make your work accessible to a lot more users. Wrt step2, you know that Deepspeed has a full TP implementation, except not in CUDA kernels - perhaps this can be utilized instead for Otherwise please ping or tag me when you need my input here or on the Deepspeed github. Looking forward to this inspiring collaboration, @hyunwoongko |
First of all, we need to discuss this collaborative process with @RezaYazdaniAminabadi. |
I'll review the code soon. Thank you. |
Here is the English version of the blog post! |
@stas00 Sorry for the delay this work. We are also making a public large-scale model that can cover Asian languages. I've been very busy these days, so I haven't had much time to contribute to Hugging Face. I will work on it as soon as possible. |
Also pinging @siddk whose team also has been working on improving For context, while your team was on a summer break, @hyunwoongko implemented Parallelformers and we started discussing how to integrate their work, while planning integration of Deepspeed CUDA kernels for TP. So now that your team is getting back let's discuss how to best collaborate. |
Oh this is awesome, thanks @stas00 and nice to meet you @hyunwoongko. Let me get up to speed on this thread, but this looks like amazing work! |
@siddk Hello. Could you please explain so I can get the context? :) |
I will resume this work from this weekend. Since my company is so busy now, most of the open source work will probably be done on weekends. I will working on deepspeed this week. I had an offline meeting with them and we are discussing how to combine. (Probably integration with Huggingface transformers will not take place soon because it is steps 3 and 4.) |
It's really cool to see this collaboration in the pipeline! I'm not affiliated with any of the frameworks/organizations here at stake, but I do come from HF BigScience side of things where I've briefly discussed things with @stas00. If there's grunt work or anything else that has to be done, I'd be more than happy to contribute in ways that I can. |
@jaketae I already know you by KoClip project. nice to meet you. Your work would be of great help. :) @stas00 Currenlty, we need to talk more with the DeepSpeed team. I will first integrate the parallelformers features into deepspeed. However, what deepspeed and transformers currently want is slightly different, so we need to adjust it.
|
As I commented in another issue: HF transformers wants both training and inference. It's just that we have a lot more users using the library for training. So there is definitely not misalignment between the two. Remember that Deepspeed already has PP, so they are just missing TP and inference. HF Transformers doesn't have those yet, hence the difference. (thanks to @hyunwoongko for correcting me that DS doesn't have TP) |
https://twitter.com/siddkaramcheti/status/1430195543301492744 |
@jaketae, the idea is to first pick one model and port it to TP and later PP. Then we will have to replicate this for all models (or at least models that will support this), so there will be a ton of work for quite a few people to contribute. |
I will close this issue. lets discuss in #13690 |
We recently developed an opensource called
parallelformers,
(https://github.com/tunib-ai/parallelformers) and have a few questions, so we write an issue here.Q. As a logo, an image homage to the hugging face was used. Not exactly the same CI, but from Unicode. Will it be a problem?
Q. What do you think about collaboration? We can include model parallelization for all models in hugging face transformers.
The following is what I posted on Reddit to promote our opensource.
Hello, I am writing to inform you about the release of Parallelformers (https://github.com/tunib-ai/parallelformers), a model parallelization library at TUNiB. Parallelformers is a toolkit that supports inference parallelism for 68 models in Huggingface Transformers with 1 line of code.
Previously, DeepSpeed-Inference was used as a parallelization toolkit for model inference.
(1) It was impossible to deploy to the web server due to the process flow,
(2) Lack of integration with Huggingface Transformers, which has now become the de facto standard for natural language processing tools. (DeepSpeed-Inference only supports 3 models)
(3) Also, since parallelization starts in the GPU state, there was a problem that all parameters of the model had to be put on the GPU before parallelization.
Parallelformers solved a number of problems in DeepSpeed-Inference. Using this toolkit internally, we were able to easily deploy a large model to our web server, reducing the cost of deployment by up to 3-5x. More detailed information and source code can be found on GitHub. Thanks !
The text was updated successfully, but these errors were encountered: