-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cross-compiled CUDA builds running out of disk space #1114
Comments
However it does kinda seem related to PCH's, in that the failure looks like:
|
This is now permanently blowing up our cross-compiled CUDA builds (both aarch & PPC) on 12.x & 11.x. On 10.x at least, the build passes (with the ~same fixes as mentioned in the OP, in particular with google-cloud-cpp disabled). |
You are looking at arrow C++ sources, but the error is in pyarrow. |
My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed) Have seen this in a couple other cross-compilation CUDA builds. Though I think that is coincidental as there is simply more stuff being downloaded in those cases. Have seen disk space issues in at least one job that doesn't do any cross-compilation (though is CUDA related) Have poked around a little bit with |
I was just collecting potentially related information; that particular option is already off by default anyway, so it wasn't a serious candidate. |
Yeah, the cross-compilation infra for CUDA 11 needs to download and unpack a bunch of artefacts (see conda-forge/conda-forge-ci-setup-feedstock#210). Would it make sense to try to move these builds to CUDA 12? Having any builds restricted to CUDA >=12 would still be better than having no builds at all. |
Giving this a shot in #1120 |
Sure that seems like a reasonable approach 👍 Happy to look over things there if you need another pair of eyes 🙂 |
If you look at pyarrow sources, you'll see that it's not 'already off by default anyway'. |
Can you be more specific what you're referring to? I gave a direct link to an option that's off by default (I didn't claim it applied to pyarrow either...). In pyarrow, I don't find something using the substring |
Exactly. Turn that off with a patch, and this issue will probably go away. |
That falls under the category "not obvious to me" - I can't tell if things are still expected to work without this (given that there's no option to toggle), and I'm not in the habit of patching out things I don't understand (for example, I'm confused why headers -- something pretty lightweight -- would blow through the disk space). But I'm happy to try it, thanks for the pointer. |
Precompiled headers are not lightweight. They are heavy. $ cat pch.h
#include <stdio.h>
$ g++ pch.h -o pch.h.gch
$ file pch.h.gch
pch.h.gch: GCC precompiled header (version 014) for C++
$ ls -alh pch.h.gch
-rw-rw-r-- 1 isuru isuru 2.2M Jul 20 14:49 pch.h.gch |
Maybe we should ask someone from the Arrow team to chime in? |
Everything is relative of course, but I don't think 2.2MB will be the reason for us running out of disk-space on the agent.
Sure. I think it's more the "fault" of our infra rather than arrow itself, but removing pyarrow's precompiled headers (i.e. viability resp. potential impact) would be good to check. Hoping you could weigh in @kou @pitrou @jorisvandenbossche @assignUser |
That's just a simple C header generating 2.2MB. Template heavy C++ headers can go up to several GBs. |
OK, thanks, finally I can see why this would be related. I still don't know why it would blow up so hard, but that's something I can investigate later. |
I can add to that suspicion as some of the space heavy arrow doc builds we run on azure have started failing due to lack of space recently and we don't understand why. Regarding the pch: my understanding is that pchs are useful to speed up build times on repeated re-builds (e.g. local development). Which is not really the case here iirc the ci setup (matrix build so each job only builds once?). So it should be fine to patch that out but it should probably also be an arrow issue to add an option for pch in pyarrow? @jorisvandenbossche |
Thanks for the info @assignUser! For now, even patching out the pch didn't work (see #1122), we've now removed some unnecessary caching in our images, also to no avail. I'm now looking at trimming some more fat in our cross-cuda setup, which I noticed is not deleting the CUDA |
Indeed thanks for the feedback! 🙏
Here are some ideas of things we might remove from the Azure images ( conda-forge/conda-smithy#1747 ) |
I don't know what change caused this (perhaps something in the CUDA setup...), but since about a month or so, cross-compiling CUDA consistently blows through the disk space of the azure workers, failing the job in an often un-restartable way.
I've tried fixing this in various ways (#1075, #1081, 7c26712, a8ca8f7, 555a42c). The problem exists on both ppc & aarch, but for ppc at least, the various fixes seem to have mostly settled things, but for aarch it's still failing 9 times out of 10.
(note, the only reason I disabled aws-sdk-cpp is that jobs started failing again with migrating to a new version that had some more features enabled, and a footprint around 40MB; this is being tackled in conda-forge/google-cloud-cpp-feedstock#141).
CC @conda-forge/cuda-compiler @jakirkham @isuruf
PS. By chance I've seen that qt also has disk space problems, and worked around this by disabling the use of precompiled headers. Arrow has an option
ARROW_USE_PRECOMPILED_HEADERS
, but it's already off by default.The text was updated successfully, but these errors were encountered: