-
-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hydra: nixos/release-20.03 and unstable fails to evaluate #79907
Comments
The last few weeks felt like we're slowly making the big nixos eval too expensive (again). Maybe not just nixos, as I've seen increase in out-of-memory failures also in jobs like |
Noticed this as well, can't open ZHF until there's an eval on the jobset. |
The thinking from Eelco is the growth of NixOS tests is causing memory pressure problems. Each VM in the tests adds a few hundred MB of RAM consumption for hydra's evaluator. |
It feels bad to be "within 5 tests" of being unable to move forward. :( |
To clarify, @edolstra's suggestion short-term is to remove some of the tests. For example, those key map tests were commented for a very long time. It would be sad to drop them again but it may be the best short-term solution. Long-term, there is a branch for a more precise GC, and possibly some optimisation work which could be made in how NixOS is evaluated. but I don't know if either of these more long-term things are possible today. That said, I'm 100% not the right person for this problem, and possibly @LnL7, @samueldr, or @fpletz, @Ma27 have advice on how to tune hydra's evaluator. |
@grahamc does it just run out of memory, or why does it fail to evaluate? |
@LnL7 could we bisect around that? |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nixos-20-03-feature-freeze/5655/32 |
Does anybody know why this only occurs for 20.03 and not trunk-combined? Evaluation for those should be equivalent (except for stableBranch but is/should be purely metadata). |
I've seen killed trunk-combined tasks earlier today while trying to trigger a eval. |
If that's the case, we might just want to remove that reference - I don't really see a reason why python should become part of glibc's runtime closure. |
I'm not sure how the closure sizes are relevant to this thread, but I can't see a significant increase of (runtime) closure size for |
stdenv size didn't change much:
and python is not in the runtime closure:
|
@vcunat It could be something totally different, but given that nixos instances will evaluate pkgs multiple times it's something that increases evaluation for each test. |
I did notice that the hydra jobsets for "trunk" now take over 100 seconds to evaluate, where they use to be significantly lower when I first started viewing hydra >6 months ago. |
The evaluator dies with |
No, I believe there are no such connections. |
It's a temporary measure until we have better ways. See #79907. (Not a real revert, as the comment wouldn't make sense, etc.)
Ouch, having glibc depend on python is really unfortunate. |
It was upstream decision to use python in the build process (build-time only dependency). I don't think we can do much about that. EDIT: using some minimal python could be nice, though. |
OK, I submitted #80112, but I still can't see how it's relevant to this thread. |
Based on the gc stats from nix the memory needed to evaluate eg. hello increased from 26mb -> 29mb with the glibc update (this has now doubled compared to 18.03 btw). This indeed isn't a big deal since it's a flat cost per architecture. However that's not the case for nixos instances, since each test imports it's own instance of nixpkgs. I can't evaluate everything on my machine with the current settings, but evaluating just the tests seems to use between 600mb and 1.5Gb more before reverting that commit. With the way evaluation currently works that's a problem if this bumps up the memory usage enough to require a larger heap. I don't know how much memory the hydra evaluator has available, but with |
If I look correctly, using |
Yeah, I'm not sure there's a good solution for this other than trying to reduce the memory "enough" without more fundamental changes. I took a quick look at the evaluation for tests, this probably isn't the right place to change and I think it would break tests that use overlays as well as multiple architectures. But something similar might work to reduce the overhead for tests quite significantly. diff --git a/nixos/lib/build-vms.nix b/nixos/lib/build-vms.nix
index 1bad63b9194..8da2504bea9 100644
--- a/nixos/lib/build-vms.nix
+++ b/nixos/lib/build-vms.nix
@@ -36,6 +36,7 @@ rec {
baseModules = (import ../modules/module-list.nix) ++
[ ../modules/virtualisation/qemu-vm.nix
../modules/testing/test-instrumentation.nix # !!! should only get added for automated test runs
+ { key = "nixpkgs-pkgs"; nixpkgs.pkgs = pkgs; }
{ key = "no-manual"; documentation.nixos.enable = false; }
{ key = "qemu"; system.build.qemu = qemu; }
{ key = "nodes"; _module.args.nodes = nodes; } |
My reading of that part is that The idea for VM tests seems intriguing. Overlays appear considered at a quick glance. |
I tried your patch with evaluation of just a pair of tests at once, and it decreased |
Yeah, I linked the wrong thing.
That looks promising, threading through pkgs for the correct system instead of just pkgs (which is always x86_64-linux) to buildMV might be an option then. I won't have time to look into this further for a few days however. |
I don't know these parts of code well, but I looked around and I still can't see any problem with that patch. I tried on Hydra, but it's still getting killed: https://hydra.nixos.org/jobset/nixos/nixos-test-expensive-eval (2/2 eval attempts killed) |
When I restricted it to just Therefore I still expect that patch helped significantly; I'd still check diff in test failures before using it for real. |
It looks like Eelco has whipped up a miracle and got evaluations passing, and in less time too. |
Bought a better server? :-) In any case, it will be nice to know how he managed it, as it's a never-ending problem. EDIT: I suspect it was some kind of cheating, as we no longer have the aggregate For long-term solutions of RAM consumption I have high hopes for NixOS/hydra#715 |
oof, 20mins for an eval. That's rough |
That seems quite a normal number IIRC. (for our big jobsets like trunk-combined) |
OK, let me ask explicitly about that miracle: how are channels going to work when we have no tested job anymore? Perhaps I just don't understand the intentions. |
Great ❤️ I pushed 20.03 backports. I believe the issue is fixed and shouldn't re-appear anytime soon. Possible TODOs:
|
This issue has been mentioned on NixOS Discourse. There might be relevant details there: |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: |
No good, even the small channels are blocked now: NixOS/hydra#715 (comment) |
Resolved and today all channels even got updated. |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: |
Describe the bug
The nixos/release-20.03 jobset fails to evaluate:
I've tried several times to trigger an evaluation, yet every time it fails.
cc @Disasm @worldofpeace @grahamc @vcunat
The text was updated successfully, but these errors were encountered: