Hydra: nixos/release-20.03 and unstable fails to evaluate #79907

FRidh · 2020-02-12T08:50:26Z

Describe the bug
The nixos/release-20.03 jobset fails to evaluate:

hydra-eval-jobs returned signal 9:
(no output)

I've tried several times to trigger an evaluation, yet every time it fails.

cc @Disasm @worldofpeace @grahamc @vcunat

The text was updated successfully, but these errors were encountered:

vcunat · 2020-02-12T09:13:06Z

The last few weeks felt like we're slowly making the big nixos eval too expensive (again). Maybe not just nixos, as I've seen increase in out-of-memory failures also in jobs like tarball, but perhaps it was just a feeling as I see no significant increase in these graphs: https://hydra.nixos.org/job/nixpkgs/trunk/metrics#tabs-charts

Disasm · 2020-02-12T09:17:11Z

cc @disassembler

worldofpeace · 2020-02-12T14:24:23Z

Noticed this as well, can't open ZHF until there's an eval on the jobset.

grahamc · 2020-02-12T17:54:14Z

The thinking from Eelco is the growth of NixOS tests is causing memory pressure problems. Each VM in the tests adds a few hundred MB of RAM consumption for hydra's evaluator.

worldofpeace · 2020-02-12T18:27:40Z

These got added 7a625e7 bf49181

grahamc · 2020-02-12T18:44:00Z

It feels bad to be "within 5 tests" of being unable to move forward. :(

grahamc · 2020-02-12T19:48:05Z

To clarify, @edolstra's suggestion short-term is to remove some of the tests. For example, those key map tests were commented for a very long time. It would be sad to drop them again but it may be the best short-term solution. Long-term, there is a branch for a more precise GC, and possibly some optimisation work which could be made in how NixOS is evaluated.

but I don't know if either of these more long-term things are possible today.

That said, I'm 100% not the right person for this problem, and possibly @LnL7, @samueldr, or @fpletz, @Ma27 have advice on how to tune hydra's evaluator.

flokli · 2020-02-12T21:12:31Z

@grahamc does it just run out of memory, or why does it fail to evaluate?

LnL7 · 2020-02-12T21:13:55Z

This was for different reasons but I've been tracking the stdenv requisite size for quite a while now, could be totally unrelated but that had a rather large jump recently.

Update this was between fa74455 and d453c2f. Most likely the libidn2 change at first glance.

flokli · 2020-02-12T21:15:12Z

@LnL7 could we bisect around that?

nixos-discourse · 2020-02-12T21:20:34Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-feature-freeze/5655/32

LnL7 · 2020-02-12T21:23:00Z

Does anybody know why this only occurs for 20.03 and not trunk-combined? Evaluation for those should be equivalent (except for stableBranch but is/should be purely metadata).

andir · 2020-02-12T21:27:59Z

I've seen killed trunk-combined tasks earlier today while trying to trigger a eval.

LnL7 · 2020-02-12T21:28:49Z

@flokli 447edaa looks like python (and not a minimal build) was introduced in the stdenv. A minimal python would bring it down from ~270 to ~240.

flokli · 2020-02-12T21:34:09Z

If that's the case, we might just want to remove that reference -

I don't really see a reason why python should become part of glibc's runtime closure.

vcunat · 2020-02-12T21:36:28Z

I'm not sure how the closure sizes are relevant to this thread, but I can't see a significant increase of (runtime) closure size for stdenv output path on x86_64-linux (and python is not there).

jonringer · 2020-02-12T21:41:13Z

stdenv size didn't change much:

[13:37:37] jon@jon-workstation ~/projects/nixpkgs (master)
$ nix path-info -Sh ./result
/nix/store/5gc1hyqbxwfwcw7l1bs7gy6rw9zbnc09-stdenv-linux	 231.6M
[13:39:35] jon@jon-workstation ~/projects/nixpkgs (release-19.09)
$ nix path-info -Sh ./result
/nix/store/qghrkvk86f9llfkcr1bxsypqbw1a4qmw-stdenv-linux	 224.4M

and python is not in the runtime closure:

[13:40:47] jon@jon-workstation ~/projects/nixpkgs (master)
$ nix-store -q --tree ./result | grep python
[13:40:58] jon@jon-workstation ~/projects/nixpkgs (master)

LnL7 · 2020-02-12T21:41:51Z

@vcunat It could be something totally different, but given that nixos instances will evaluate pkgs multiple times it's something that increases evaluation for each test.

jonringer · 2020-02-12T21:43:11Z

I did notice that the hydra jobsets for "trunk" now take over 100 seconds to evaluate, where they use to be significantly lower when I first started viewing hydra >6 months ago.

FRidh · 2020-02-13T07:23:43Z

The evaluator dies with hydra-eval-jobs returned signal 9 but also random builds fail with 9. Would the evaluator kill remote jobs when it runs out of memory? Or could those be builds that happen to run on the evaluator?

vcunat · 2020-02-13T07:50:33Z

No, I believe there are no such connections.

It's a temporary measure until we have better ways. See #79907. (Not a real revert, as the comment wouldn't make sense, etc.)

edolstra · 2020-02-13T21:13:25Z

Ouch, having glibc depend on python is really unfortunate.

vcunat · 2020-02-13T21:25:52Z

It was upstream decision to use python in the build process (build-time only dependency). I don't think we can do much about that. EDIT: using some minimal python could be nice, though.

flokli · 2020-02-14T11:14:18Z

@vcunat you could probably switch that occurence to python3Minimal, introduced in #66762, which should have a smaller build and runtime closure - if you don't rely on things like libreadline or ssl support.

vcunat · 2020-02-14T12:36:32Z

OK, I submitted #80112, but I still can't see how it's relevant to this thread.

LnL7 · 2020-02-14T12:51:32Z

Based on the gc stats from nix the memory needed to evaluate eg. hello increased from 26mb -> 29mb with the glibc update (this has now doubled compared to 18.03 btw). This indeed isn't a big deal since it's a flat cost per architecture. However that's not the case for nixos instances, since each test imports it's own instance of nixpkgs.

I can't evaluate everything on my machine with the current settings, but evaluating just the tests seems to use between 600mb and 1.5Gb more before reverting that commit. With the way evaluation currently works that's a problem if this bumps up the memory usage enough to require a larger heap.

I don't know how much memory the hydra evaluator has available, but with GC_INITIAL_HEAP_SIZE=20G both 20.03 and older releases evaluated without issues. The larger heap size does result in higher average memory usage however which might be a problem for concurrent evaluations.

vcunat · 2020-02-14T13:04:37Z

If I look correctly, using python3Minimal recovers only a small fraction of this increase.

LnL7 · 2020-02-14T13:54:22Z

Yeah, I'm not sure there's a good solution for this other than trying to reduce the memory "enough" without more fundamental changes.

I took a quick look at the evaluation for tests, this probably isn't the right place to change and I think it would break tests that use overlays as well as multiple architectures. But something similar might work to reduce the overhead for tests quite significantly.

diff --git a/nixos/lib/build-vms.nix b/nixos/lib/build-vms.nix
index 1bad63b9194..8da2504bea9 100644
--- a/nixos/lib/build-vms.nix
+++ b/nixos/lib/build-vms.nix
@@ -36,6 +36,7 @@ rec {
       baseModules =  (import ../modules/module-list.nix) ++
         [ ../modules/virtualisation/qemu-vm.nix
           ../modules/testing/test-instrumentation.nix # !!! should only get added for automated test runs
+          { key = "nixpkgs-pkgs"; nixpkgs.pkgs = pkgs; }
           { key = "no-manual"; documentation.nixos.enable = false; }
           { key = "qemu"; system.build.qemu = qemu; }
           { key = "nodes"; _module.args.nodes = nodes; }

vcunat · 2020-02-14T15:15:53Z

However that's not the case for nixos instances, since each test imports it's own instance of nixpkgs.

My reading of that part is that pkgs is passed through and not re-imported.

The idea for VM tests seems intriguing. Overlays appear considered at a quick glance.

vcunat · 2020-02-14T15:31:56Z

I tried your patch with evaluation of just a pair of tests at once, and it decreased gc.totalBytes by ~22%

LnL7 · 2020-02-14T15:59:54Z

Yeah, I linked the wrong thing.

Overlays appear considered at a quick glance.

That looks promising, threading through pkgs for the correct system instead of just pkgs (which is always x86_64-linux) to buildMV might be an option then. I won't have time to look into this further for a few days however.

vcunat · 2020-02-14T16:32:17Z

I don't know these parts of code well, but I looked around and I still can't see any problem with that patch. I tried on Hydra, but it's still getting killed: https://hydra.nixos.org/jobset/nixos/nixos-test-expensive-eval (2/2 eval attempts killed)

vcunat · 2020-02-15T17:13:58Z

When I restricted it to just x86_64-linux, it succeeded on second attempt. I'm hopeful to use this approach for now. Note that 20.03 was also created just for x86_64-linux and couldn't get evaluation even after cutting some tests in ceb90b0... at least until a while ago (not sure what's changed).

Therefore I still expect that patch helped significantly; I'd still check diff in test failures before using it for real.

grahamc · 2020-02-15T17:49:49Z

It looks like Eelco has whipped up a miracle and got evaluations passing, and in less time too.

vcunat · 2020-02-15T17:59:22Z

Bought a better server? :-) In any case, it will be nice to know how he managed it, as it's a never-ending problem. EDIT: I suspect it was some kind of cheating, as we no longer have the aggregate tested job, neither in trunk-combined nor in release-20.03.

For long-term solutions of RAM consumption I have high hopes for NixOS/hydra#715

jonringer · 2020-02-15T18:35:12Z

Evaluation 1570647 of jobset nixos:nixos-test-expensive-eval
Compare to...

This evaluation was performed on 2020-02-15 00:59:23. Fetching the dependencies took 3s and evaluation took 1109s

oof, 20mins for an eval. That's rough

vcunat · 2020-02-15T18:41:13Z

That seems quite a normal number IIRC. (for our big jobsets like trunk-combined)

vcunat · 2020-02-17T15:37:01Z

OK, let me ask explicitly about that miracle: how are channels going to work when we have no tested job anymore? Perhaps I just don't understand the intentions.

edolstra · 2020-02-17T18:06:50Z

The tested job is back (it was never gone but it did have an evaluation error). We'll need to backport 2de3caf and 8950429 to the 20.03 branch.

vcunat · 2020-02-17T18:48:20Z

Great ❤️ I pushed 20.03 backports.

I believe the issue is fixed and shouldn't re-appear anytime soon. Possible TODOs:

backport to 19.09. It probably will keep evaluating without that, but we could have it cheaper (for the several remaining months). It surely doesn't apply cleanly, but it should be a mechanical change.
still consider the approach from LnL; perhaps we can get even better performance thanks to that.

nixos-discourse · 2020-02-17T22:40:20Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-beta/5935/1

nixos-discourse · 2020-02-18T12:24:40Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/firefox-not-up-to-date/5941/2

vcunat · 2020-02-18T19:20:22Z

No good, even the small channels are blocked now: NixOS/hydra#715 (comment)

vcunat · 2020-02-20T20:25:33Z

Resolved and today all channels even got updated.

nixos-discourse · 2020-02-20T21:44:22Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-beta/5935/7

FRidh added 0.kind: bug Something is broken 1.severity: channel blocker Blocks a channel labels Feb 12, 2020

FRidh added this to the 20.03 milestone Feb 12, 2020

worldofpeace pinned this issue Feb 12, 2020

worldofpeace added the 1.severity: blocker This is preventing another PR or issue from being completed label Feb 13, 2020

vcunat added a commit that referenced this issue Feb 13, 2020

Revert-like "Merge #79656: release-combined: readd keymap tests"

ceb90b0

It's a temporary measure until we have better ways. See #79907. (Not a real revert, as the comment wouldn't make sense, etc.)

vcunat mentioned this issue Feb 13, 2020

Disable most VM tests #80036

Closed

10 tasks

grahamc changed the title ~~Hydra: nixos/release-20.03 fails to evaluate~~ Hydra: nixos/release-20.03 and unstable fails to evaluate Feb 14, 2020

bjornfor added the 0.kind: regression Something that worked before working no longer label Feb 15, 2020

vcunat closed this as completed Feb 17, 2020

worldofpeace unpinned this issue Feb 17, 2020

worldofpeace mentioned this issue Feb 17, 2020

ZERO Hydra Failures 20.03 #80379

Closed

vcunat reopened this Feb 18, 2020

vcunat closed this as completed Feb 20, 2020

Hydra: nixos/release-20.03 and unstable fails to evaluate #79907

Hydra: nixos/release-20.03 and unstable fails to evaluate #79907

Comments

FRidh commented Feb 12, 2020 • edited Loading

vcunat commented Feb 12, 2020 • edited Loading

Disasm commented Feb 12, 2020

worldofpeace commented Feb 12, 2020

grahamc commented Feb 12, 2020

worldofpeace commented Feb 12, 2020 • edited Loading

grahamc commented Feb 12, 2020

grahamc commented Feb 12, 2020

flokli commented Feb 12, 2020

LnL7 commented Feb 12, 2020 • edited Loading

flokli commented Feb 12, 2020

nixos-discourse commented Feb 12, 2020

LnL7 commented Feb 12, 2020 • edited Loading

andir commented Feb 12, 2020

LnL7 commented Feb 12, 2020

flokli commented Feb 12, 2020

vcunat commented Feb 12, 2020

jonringer commented Feb 12, 2020 • edited Loading

LnL7 commented Feb 12, 2020

jonringer commented Feb 12, 2020 • edited Loading

FRidh commented Feb 13, 2020

vcunat commented Feb 13, 2020

edolstra commented Feb 13, 2020

vcunat commented Feb 13, 2020 • edited Loading

flokli commented Feb 14, 2020

vcunat commented Feb 14, 2020

LnL7 commented Feb 14, 2020

vcunat commented Feb 14, 2020 • edited Loading

LnL7 commented Feb 14, 2020 • edited Loading

vcunat commented Feb 14, 2020

vcunat commented Feb 14, 2020

LnL7 commented Feb 14, 2020

vcunat commented Feb 14, 2020

vcunat commented Feb 15, 2020 • edited Loading

grahamc commented Feb 15, 2020

vcunat commented Feb 15, 2020 • edited Loading

jonringer commented Feb 15, 2020

vcunat commented Feb 15, 2020 • edited Loading

vcunat commented Feb 17, 2020 • edited Loading

edolstra commented Feb 17, 2020

vcunat commented Feb 17, 2020

nixos-discourse commented Feb 17, 2020

nixos-discourse commented Feb 18, 2020

vcunat commented Feb 18, 2020

vcunat commented Feb 20, 2020

nixos-discourse commented Feb 20, 2020

FRidh commented Feb 12, 2020 •

edited

Loading

vcunat commented Feb 12, 2020 •

edited

Loading

worldofpeace commented Feb 12, 2020 •

edited

Loading

LnL7 commented Feb 12, 2020 •

edited

Loading

LnL7 commented Feb 12, 2020 •

edited

Loading

jonringer commented Feb 12, 2020 •

edited

Loading

jonringer commented Feb 12, 2020 •

edited

Loading

vcunat commented Feb 13, 2020 •

edited

Loading

vcunat commented Feb 14, 2020 •

edited

Loading

LnL7 commented Feb 14, 2020 •

edited

Loading

vcunat commented Feb 15, 2020 •

edited

Loading

vcunat commented Feb 15, 2020 •

edited

Loading

vcunat commented Feb 15, 2020 •

edited

Loading

vcunat commented Feb 17, 2020 •

edited

Loading