Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patches from 6.1.1 to 6.1.2, watch for regressions #125

Closed
jeroen-mostert opened this issue Aug 4, 2024 · 7 comments
Closed

Patches from 6.1.1 to 6.1.2, watch for regressions #125

jeroen-mostert opened this issue Aug 4, 2024 · 7 comments

Comments

@jeroen-mostert
Copy link
Contributor

I've noticed at least one patch that was present in 6.1.1 that has disappeared in 6.1.2 even though the issue it fixed is still relevant and the patch should still be there, namely this one for Aotriton. This issue is just as a general reminder that we should review if regressions are happening from one release to the next, because it's easy to miss patches if they have to be regenerated with new releases of the software but they don't directly contribute to not breaking the build on our test distros.

@lamikr
Copy link
Owner

lamikr commented Aug 5, 2024

Thanks, I will cherry pick that one.

As a side I have started merging some patches already for rocm-6.2 release and composable kernel is not easy... With initial set, it builded fast for gfx1030 but in reality did not have almost anything in as new makefiles had filtered almost anyrthing away unless you build for MI* devices... MIOpen then failed to build because of the missing libraries which did not get build for that reason. (device contraction library, etc...)

There is now 90 min video on youtube related to composible kernels and speding time on listening it.
https://www.youtube.com/watch?v=-732zELVbpU

@jeroen-mostert
Copy link
Contributor Author

I'm honestly starting to think that the majority of the most time-consuming parts of our build are actually completely irrelevant for most users because the repos in question only target business hardware. Even when they can be made to compile for consumer hardware they don't actually do anything in terms of speeding up software in the wild that can make meaningful use of those libraries, since AMD tends to only target specific scenarios.

Composable kernels is another huge time sink, so to hear that this one is also "for the birds" is disheartening. There are plenty of theoretical avenues where things might get compiled for consumer hardware with some effort (like some of the stuff we've been doing with Tensile), but so far any speedup you do see there tends to have to come from people who specifically build their own software to target ROCm, not using any libraries AMD has come up with.

It might be nice to take stock to see if we can see what category things will fall in: 1) targets business hardware and is not meaningful to port because consumer hardware lacks the features; 2) targets business hardware but could be made to compile for consumer hardware, just doesn't at the moment and 3) targets business hardware but also does compile for consumer hardware and provides features; 4) targets anything AMD and just works regardless. 3 and 4 should obviously be included in our repo, 2 should be masked off until we have those patches and confirm they actually do something in real software and 1 should probably just go, since the gain for business users is minimal and the cost in compilation, disk speed and broken dreams ( :P ) for consumer users is considerable.

If we do want to keep this because we want to serve as the ultimate ROCm repository for anything, it is at least worth considering splitting things up in a separate binfo list that is gated to compile only if you're actually targeting the hardware. Alternatively we could extend the binfos themselves with a "do not compile unless any of these arches are getting targeted" field or guard.

@lamikr
Copy link
Owner

lamikr commented Aug 5, 2024

I think that the big part of the value of this project is that it allows building the whole stack and therefore makes it much easier also to catch the problems and then have at least a change to fix those by adding support for missing parts.
And there can be also a "regular" users for example in universities and companies who has access to these pro-cards and even for them it's valuable to be able to build whole rocm stack to be able to customize and maintain it.

Another aspect is that the Composable kernel and triton are two different approaches to same problem. They both claim that they can have common higher level of abstraction code that can then be build easily for multiple different gpu architectures. For example the flashed attention required by the xtransfer and pytorch can be implemented either on composable kernel or triton. (And there are amd implementation of those for both, some even in some random branches for navi cards). Problem is just that those random bits are at the moment not integrated to anything. That anything could be the rocm sdk builder.

Composable kernel for example is needed nowadays by many packages like MIOpen, onnxruntime, xtransfers, we need to have them on while targeting to consumer level gpus also. But what is needed is the better navi card specific support both on the triton and composable kernel. It's kind of already work, otherwise the pytorch 2.4.0 build would already fail.

At the moment I am trying to learn both the triton and composable kernel by myself so I could write the missing navi21/navi31 parts if needed. (And as much navi10 support as possible while those cards starts to be most propably get updated for many to newer gpus). It's not clear for me whether both of these approaches are needed in the long term or will the composable kernel just vanish if it turns out that it's easier to implement those parts with triton. If composable kernel for example claims that it's a easy framework for supporting a multiple different hardware's easily but fails to do that on navi-cards, then there is a clear cap between speak and reality.

@lamikr
Copy link
Owner

lamikr commented Aug 5, 2024

And as a jump to another issue... Do you have time to check the ffmpeg7 patch back for pytorch audio in wip/rocm_sdk_builder_612_pytorch240 branch?

You do not need to build everything. It's enought to re-build the amdsmi, pytorch, pytorch vision and pytorch audio.
After -co/ap branches, you may need to do these also for src/pytorch

cd src_project/pytorch
rm -rf *
git reset --hard
git submodule update --init --recursive

@jeroen-mostert
Copy link
Contributor Author

Updating the patch is trivial; instances of c10::optional have been replaced with std::optional. These are the only changes to the cpp parts, there are no functional changes. Once these are also changed in the patch (sed -i 's/c10::optional/std::optional/s' 0004-add-support-for-FFmpeg-7.patch) it applies and builds as before.

If you want me to test this and submit it as a pull request it will take longer, as I'm actively using PyTorch in a lengthy operation and so I can't build a new version with babs replacing it. :P

@jeroen-mostert
Copy link
Contributor Author

I think that the big part of the value of this project is that it allows building the whole stack and therefore makes it much easier also to catch the problems and then have at least a change to fix those by adding support for missing parts. And there can be also a "regular" users for example in universities and companies who has access to these pro-cards and even for them it's valuable to be able to build whole rocm stack to be able to customize and maintain it.

It would be nice if we could maybe attract some of those users to actually build it and submit back changes. I am of course extremely biased from my POV as a consumer card user with no access to these fancy gadgets who's also contributing to this project, and hence building it much more than a regular user would. Those build times sting. :P

Another aspect is that the Composable kernel and triton are two different approaches to same problem. They both claim that they can have common higher level of abstraction code that can then be build easily for multiple different gpu architectures. For example the flashed attention required by the xtransfer and pytorch can be implemented either on composable kernel or triton. (And there are amd implementation of those for both, some even in some random branches for navi cards). Problem is just that those random bits are at the moment not integrated to anything. That anything could be the rocm sdk builder.

A bigger issue is that the attention either of these supposedly offer tends to be unsupported, unused or have bad performance when actual projects have to use them. I can still see PyTorch complain that "Torch was not compiled with flash attention", and apparently even if you can get past that the performance sucks (pytorch/pytorch#112997). Now, of course, these things can be improved, in theory, it is however frustrating that the only thing I've seen from AMD is articles where they explain how to get some of this working using very specific builds from AMD themselves and then only on business hardware. Of course this is not directly related to this repo and none of that is your fault, I just wanted to vent a bit. :P

@lamikr
Copy link
Owner

lamikr commented Aug 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants