Community Integration: Colossal-AI for Large AI Models #18624

binmakeswell · 2022-08-15T09:26:42Z

Feature request

Dear Hugging Face Team,

My name is Yongbin Li. I am part of Colossal-AI Team.

Thanks for your previous invitation to Colossal-AI org to join Hugging Face. We are happy to share our founder's blog about Hugging Face.

We are thinking about further collaboration, eg. integrating Colossal-AI into Hugging Face to help your community members use large AI models in an efficient and easier manner.

For example, we can democratize its access to all your users in the same way as you did with DeepSpeed.
https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/deepspeed

Motivation

We believe the democratization of large AI models is also very helpful for Hugging Face members. We are very appreciated if we could build the integration with you to benefit both of our users.

Actually, we are working on similar integrations with Meta OPT(done), PyTorch Lightning(in process), etc.

Your contribution

We can provide help you need in this cooperation for free. Actually, we have reached a preliminary idea with your team member: omar, lysandre, and julien via email([email protected]) and look forward to your further reply.

Feel free to reach out to me on Hugging Face Discord. My username is billy2022. We can discuss more details with other colleagues in a private group.

Thank you very much.

Best regards,
Yongbin Li, Chief Marketing Officer, HPC-AI Tech

binmakeswell · 2022-08-17T08:59:27Z

If you have any difficulties or concerns, please let me know.
We can have further discussion about them, thanks. :-)

flozi00 · 2022-08-17T10:46:13Z

@stas00
seems much better than #17392

stas00 · 2022-08-17T16:48:48Z

I haven't had a change to read on Colossal-AI yet, why do you believe it's much better based on your research, @flozi00? I did notice that it suggests the integration of PatrickStar's functionality.

CAI appears to be its own eco-system - not sure how easy it'd be to integrate with our eco-system.

flozi00 · 2022-08-17T18:24:48Z

https://github.com/hpcaitech/ColossalAI-Examples/blob/757514d2b1501d3530777cdf567f0a18063acf2d/image/resnet/train.py#L82-L111

In terms of code, it looks very similar to a normal pytorch training loop
Did not had a deep look into the CAI code itself, focused on integration compitability to existing code
to me it looks like you don't have to deal with the integration of patrickstar since everything is handled by CAI
the dependencies are also manageable

I already noticed some time ago, that is was for a range of time in the trends of paperswithcode

The benchmarks looks pretty nice on the first take, but are a little bit confusing too.
https://github.com/hpcaitech/ColossalAI#gpt-2
For RAM, Model size and throughput comparison are different techniques used (pytorch, deepspeed, megatron), did not checked if its only cherry picking or really does not matter which one to use

In any case, I think it's not bad to test alternatives to deepspeed.
At first glance, the integration into existing pytorch code looks feasible without major problems.
Also, with the expertise of both organizations, the integration could be done without much trouble for a single one, with CAI offering to help with the integration "We are very appreciated if we could build the integration with you to benefit both of our users".

stas00 · 2022-08-18T03:16:22Z

Thank you for sharing your insights, @flozi00!

I read their paper and I'm not quite sure of what type of integration is proposed here. Unlike Deepspeed which is meant to be integrated with the user code, CAI seems to be a standalone solution.

One of the biggest issues with any parallelism proposals (other than DDP) is that they all require rewriting the model's code, which with 100+ models and growing in our arsenal would be prohibitively expensive. Therefore we always welcome automated solutions like Deepspeed which require no changes whatsoever to most models and sometimes a small tweak for some peculiar models.

It's definitely worth exploring all the different versions of TP (2/2.5/3D) mentioned in the paper, but we need this automated and not manually rewritten.

The paper briefly mentions PP, but as we all know this one definitely requires a complete rewrite of the model for most frameworks.

So again let's ask a very concrete question - other than being part of the HF ecosystem what is the vision for the proposed integration?

We already have 2 trainer loop systems (HF Trainer and Accelerate) and we won't want to maintain a 3rd one.

Do you need to inject something into the modeling_utils.py to better support CAI?

Do you propose to rewrite the models to support?

Perhaps let's take one HF Transformers model of your choice and tell us what would you like to do with it to have it run on CAI? This would be more practical.

and specifically to your interest @flozi00 - yes, I hear you like the advanced memory utilization proposed in PatrickStar and CAI suggests to have integrated that functionality.

I hope my commentary was constructive, we are definitely open for good improvements to our tools. It's just I'm weary to add yet another tool unless a clear advantage and ease of integration can be shown.

stas00 · 2022-08-18T03:18:47Z

Also, let's ping @hyunwoongko - Kevin, I know you have studied many frameworks while building https://github.com/tunib-ai/oslo - have you by chance researched Colossal-AI on your journey? If you did, would you kindly share a few insights if you have any? I know you were cherry picking the best parts from many systems in addition to your own innovations.

flozi00 · 2022-08-18T04:50:02Z

I'm sorry to admit that I didn't think of the backwards compatibility, totally forgot about that point, sorry.

I focused mainly on the integration in the trainer and did not include the now very many architectures and weights.

Maybe CAI has an idea to automate that ?
What about the integration to lightning, did they had discussed that point too ?

I have some ideas in mind but that would be more part of CAI itself or third party tools, about finding JIT methods to convert the required model parts, instead of the HF integration.

stas00 · 2022-08-18T04:59:32Z

I'm sorry to admit that I didn't think of the backwards compatibility, totally forgot about that point, sorry.

I focused mainly on the integration in the trainer and did not include the now very many architectures and weights.

No harm done. This is totally understandable - the HF transformers eco-system has been becoming more and more complex so often it's far from trivial to add yet another component to it.

We are super welcoming solutions that can automate performance enhancements (like torchdynamo - see below).

Maybe CAI has an idea to automate that ? What about the integration to lightning, did they had discussed that point too ?

PL is a training framework/loop, last I looked they didn't have the model library and were using transformers, so they don't need to deal with modeling.

I have some ideas in mind but that would be more part of CAI itself or third party tools, about finding JIT methods to convert the required model parts, instead of the HF integration.

there is already work being done on that with torchdynamo/nvfuser - it's not fully stable yet, but shows some impressive speed ups (and lower memory usage) for converting normal pytorch code to fused kernels - but this is a different dimension to parallelism and advanced memory management systems. It's definitely not a replacement for parallelism, as it can save 2x memory, or provide a 2x speed up, but it's far from enough for 100B+ models.

Please see the HF integration details here:
https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#inference-with-torchdynamo

ver217 · 2022-08-19T03:21:58Z

Hi, we drafted a pull request which intergrates ColossalAI to lightning. And here are exmaples and benchmark https://github.com/hpcaitech/ColossalAI-Pytorch-lightning. We have impletemented ZeRO-DP with chunk-based memory management and heterogeneous memory management. I think this is not hard to intergrate to HF. Besides, we are working on auto parallelism. I believe we can use TP/PP without modifying model in the future.

stas00 · 2022-08-19T16:18:42Z

OK, so at the moment you're proposing to integrate CAI for:

its ZeRO-DP with chunk-based memory management and heterogeneous memory management. This is something that Deepspeed is lacking at the moment (and if I understand correctly the technology comes from PatrickStar)
down the road for auto-parallelism

@sgugger, should this perhaps go straight into accelerate?

(Sylvain is on vacation, so please let's wait a bit for him to be back and advise on how to best to proceed.)

sgugger · 2022-08-31T13:38:22Z

We'll probably need to duplicate the integration in the Trainer and Accelerate for now, since the Trainer does not depend on Accelerate.

github-actions · 2022-09-24T15:01:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 mentioned this issue Aug 19, 2022

[Deepspeed alternative] PatrickStar #17392

Closed

github-actions bot closed this as completed Oct 2, 2022

fancyerii mentioned this issue Mar 11, 2024

[FEATURE]: Intergration with huggingface accelerate？ hpcaitech/ColossalAI#5439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community Integration: Colossal-AI for Large AI Models #18624

Community Integration: Colossal-AI for Large AI Models #18624

binmakeswell commented Aug 15, 2022 •

edited

Loading

binmakeswell commented Aug 17, 2022

flozi00 commented Aug 17, 2022

stas00 commented Aug 17, 2022 •

edited

Loading

flozi00 commented Aug 17, 2022

stas00 commented Aug 18, 2022

stas00 commented Aug 18, 2022

flozi00 commented Aug 18, 2022

stas00 commented Aug 18, 2022 •

edited

Loading

ver217 commented Aug 19, 2022

stas00 commented Aug 19, 2022

sgugger commented Aug 31, 2022

github-actions bot commented Sep 24, 2022

Community Integration: Colossal-AI for Large AI Models #18624

Community Integration: Colossal-AI for Large AI Models #18624

Comments

binmakeswell commented Aug 15, 2022 • edited Loading

Feature request

Motivation

Your contribution

binmakeswell commented Aug 17, 2022

flozi00 commented Aug 17, 2022

stas00 commented Aug 17, 2022 • edited Loading

flozi00 commented Aug 17, 2022

stas00 commented Aug 18, 2022

stas00 commented Aug 18, 2022

flozi00 commented Aug 18, 2022

stas00 commented Aug 18, 2022 • edited Loading

ver217 commented Aug 19, 2022

stas00 commented Aug 19, 2022

sgugger commented Aug 31, 2022

github-actions bot commented Sep 24, 2022

binmakeswell commented Aug 15, 2022 •

edited

Loading

stas00 commented Aug 17, 2022 •

edited

Loading

stas00 commented Aug 18, 2022 •

edited

Loading