-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testscript: fix "signal: killed" exec errors on MacOS by adding a 200ms sleep on darwin #219
Conversation
Thanks for digging into this! It's great to have someone with a machine that is able to reproduce do this investigation, as I was unable to.
I'm curious, what would make them not finish in time? We wait for the How did you arrive at the 50ms? Is that number long enough to make the failures virtually impossible on your laptop? I'm almost tempted to double or even quadruple that number if so, because CI runners are often many times slower than a modern laptop. I'd also like to have this sleep behind a Also, interesting that CI still failed with the same error on Mac :) Finally - it would be good to still figure out a way to make this work well without sleeping. Sleeping is always going to be either too short and flakey, or too long and wasteful. If there really isn't anything we can do but sleep, perhaps we could do something a bit better, such as sleeping in a loop until calling the last of the commands as a no-op (e.g. with |
Also note that the failure in CI above is from executing |
That's a rather arbitrary number. I did test higher, but not lower numbers. My test case is pretty aggressive and I can easily run that may times without any error -- but this is the setup part of the scripts, so there's no reason not to go for a longer sleep.
Sure, but there's a lot of of file copying and mkdiring in
I have force pushed (bad habit ...) a version with 200ms sleep + a runtime.GOOS == "darwin" conditional. Let's see if the test passes (the failing test must be unrelated). |
7beb349
to
5bad37a
Compare
That's awesome if it works, can you try File.Sync instead of the sleep (if possible) to see if that solves it as well? |
I'll see if I can find some time during the day. So, it fixes my problem, so in that sense it works. I run tests on save in my editor, and before this patch I get this error at about 1/4 of the test runs (depending on the size of the project under test), and have to rerun the test, which breaks my workflow pretty bad. Whether this fixes the CI problem is yet to be proven, but the symptoms are the same, so I would be surprised if it didn't. |
OK, I tried a variant to add a https://github.com/rogpeppe/go-internal/blob/master/testscript/exe.go#L126
|
5bad37a
to
7ff90d9
Compare
31202af
to
f6df25d
Compare
Interesting, I wouldn't have guessed that the hard links were to blame. Note that they're hard links though, not symbolic links. I can't quite remember why I chose
I would actually say the opposite. Copying a binary will generally take much, much less than 50ms or 200ms. A hard link is typically even faster, yes, but here for MacOS we're aiming for avoiding the failures over speed. If disabling hard links on Mac fixes the issue, like we already do on Windows, then I would say that's definitely a better fix than an arbitrary sleep. |
On `MacOS` there are lots of reports of unexpected failing tests with output similar to this: ``` [signal: killed] FAIL: testscripts/myecho.txt:1: unexpected command failure ``` On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub). In development on a MacBook, this is not an option. This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test command hard links to be ready. See rogpeppe#200
f6df25d
to
6f1ce9e
Compare
OK, my mistake.
You're duplicating the binary n times (where n is the number of commands in the map, which can be many). If you have many commands/big binaries, I doubt you can beat 200ms, but it doesn't matter much. I have created this PR: Which also fixes this issue for me. Pick the one you like. |
Both have passed CI now, so they both seem to help. I'm inclined to go with the other PR, just because sleeps are always going to be flakey or slow. I'll leave both PRs up for 24h to see if other Mac users chime in. |
@mvdan note that you should also bump the https://github.com/rogpeppe/go-internal/blob/master/.github/workflows/test.yml#L19 But I'm pretty sure that's not picked up from a PR from a fork, so you should do that as a follow up. As I said, I'm 100% sure that this fixes my problem, but I'm not 100% sure that it also fixes all CI related problems. |
@mvdan I had a second look on this, and I think I want to put in a vote for the "sleep fix". I checked the bin folders for a test I have running: This patch:
The other patch:
The way the binaries are created are very clever, but it means lots of potentially very large files created on each run if your not able to hard link it. |
Yeah, more than happy to bump the macos version back to latest once we're reasonably sure that the flakes are solved.
That's true, although these are temporary files, so does the size matter that much? Speaking for myself, a 200ms sleep on every test run would bother me more than using another 50-100MiB worth of I'd of course love to find a solution to continue using hard links. Sleeping doesn't seem like a good solution :) If we knew what to poll for, or somehow tell when the binaries are "ready" to be executed, then at least we could incrementally sleep until done. |
You understand that my numbers above are 50MiB vs 500MiB, right? I have tests that 1. I run on save in my editor and 2. Typically takes a second or two to finish (so I don't care too much about 200ms extra). For arguments sake, let's say I have a bigger project with more commands so it becomes 1GB on every save in my editor, that's a lot of SSD writes during a year ... Also, knowing this is a issue with
|
@bep You've probably tried this already, but given that you can reproduce this problem reasonably reliably, it might be interesting to try stat-ing and reading the contents of the link just after creating it. I'm specifically wondering if it can ever happen that that succeeds when the later exec fails. If not, then this could be used as the basis for a polling loop, assuming the availability is a global property of the linked file. Another more direct way to implement a polling loop is actually to try to exec the commands immediately after creating the links. We could set some environment variable to make the action of the command a no-op in this case. (For the record, I'm not really in favour of either the sleep or the copy-the-binary solution, unless those really are the only way forward). |
Would symlinks instead of hardlinks work better/out of the box? I vaguely remember issues with hardlinks turning to copies when crossing fs and maybe tmpdir is different fs? |
@ldemailly see #221. |
Thanks for checking those. If symlinks don't have this race problem, then hopefully that's the easiest, as we don't need any sleeping nor polling. |
On
MacOS
there are lots of reports of unexpected failing tests with output similar to this:On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).
In development on a MacBook, this is not an option.
This commit works around what seem to be a upstream bug in
os.Link
by adding a small sleep beforeTestingM.Run()
to allow the write of the testcommand symlinks to be ready.
See #200
Fixes #200