Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitLab CI: hack to deal with GHC heisenbug #2443

Closed
wants to merge 9 commits into from

Conversation

DigitalBrains1
Copy link
Member

@DigitalBrains1 DigitalBrains1 commented Mar 24, 2023

Every now and then, GHC will exit with the error

out: mmap 131072 bytes at (nil): Cannot allocate memory
out: Try specifying an address with +RTS -xm<addr> -RTS
out: internal error: m32_allocator_init: Failed to map
    (GHC version 9.0.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

(when the binary is named out). For some reason this problem has become more pronounced for us. Since we invoke GHC/Clash an awful amount of times in some of our CI tests, the chances of hitting it in one of those invocations are really high. Additionally, it seems some binaries have really high odds of exhibiting the issue.

This commit wraps the ghc, ghci, clash and clashi binaries in a Bash script that will retry for a total of twenty(!) times when this error message is observed. The number of retries can be configured with the "-t" option argument.

However, the test suite also compiles Haskell code to a binary and then runs that binary. These binaries have the same issues, but they don't come from the PATH, so we can't intercept them like we can for things that are on the PATH. For this, we introduce a new Tasty test provider that also tries up to twenty times when the heisenbug's error message is observed.

We need both solutions because we are also seeing the problem on doctests wich don't involve our Tasty test providers, so these need to be covered by the script approach. Any clash invocations from Tasty are not retried since the Bash script already does that.

We think this problem occurs on every combination of GHC version and Linux kernel version, but we are seeing it (almost?) exclusively on GHC 9.0.2.

@christiaanb
Copy link
Member

Have we tried to actually do the thing in the message? E.g. run the failing executables with +RTS -xm20000000 -RTS: https://downloads.haskell.org/ghc/latest/docs/users_guide/runtime_control.html#rts-flag--xm%20⟨address⟩

@DigitalBrains1 DigitalBrains1 force-pushed the retry-heisenbug branch 2 times, most recently from d4338a9 to f1a49cf Compare March 25, 2023 11:54
@DigitalBrains1 DigitalBrains1 marked this pull request as draft March 25, 2023 11:54
@DigitalBrains1 DigitalBrains1 force-pushed the retry-heisenbug branch 8 times, most recently from d9c03f4 to 72368be Compare March 25, 2023 15:30
@DigitalBrains1 DigitalBrains1 marked this pull request as ready for review March 25, 2023 15:36
Copy link
Member

@martijnbastiaan martijnbastiaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we did @christiaanb, so we could try it once and see if that fixes things (I kinda doubt it though). Otherwise LGTM.

Every now and then, GHC will exit with the error

```
out: mmap 131072 bytes at (nil): Cannot allocate memory
out: Try specifying an address with +RTS -xm<addr> -RTS
out: internal error: m32_allocator_init: Failed to map
    (GHC version 9.0.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
```

(when the binary is named `out`). For some reason this problem has
become more pronounced for us. Since we invoke GHC/Clash an awful amount
of times in some of our CI tests, the chances of hitting it in one of
those invocations are really high. Additionally, it seems some binaries
have really high odds of exhibiting the issue.

This commit wraps the `ghc`, `ghci`, `clash` and `clashi` binaries in a
Bash script that will retry for a total of twenty(!) times when this
error message is observed. The number of retries can be configured with
the "-t" option argument.

However, the test suite also compiles Haskell code to a binary and then
runs that binary. These binaries have the same issues, but they don't
come from the PATH, so we can't intercept them like we can for things
that are on the PATH. For this, we introduce a new Tasty test provider
that also tries up to twenty times when the heisenbug's error message is
observed.

We need both solutions because we are also seeing the problem on
`doctests` wich don't involve our Tasty test providers, so these need to
be covered by the script approach. Any `clash` invocations from Tasty
are not retried since the Bash script already does that.

We think this problem occurs on every combination of GHC version and
Linux kernel version, but we are seeing it (almost?) exclusively on GHC
9.0.2.
@DigitalBrains1
Copy link
Member Author

Interestingly, compiling a binary with -rtsopts and then invoking the binary with +RTS -xm20000000 -RTS does seem to fix the problem. At least, I don't see it anymore where the problem was very pronounced before. Note that since it is actually clash that might fail, we'd need to compile Clash itself with -rtsopts. But that might be the better fix.

It is interesting how the doc seems to suggest that this problem is not about compiled Haskell binaries at all:

❗Warning
[...] Do not use unless GHCi fails with a message like [...]

GHCi has nothing to do with the problem we see here whatsoever.

@DigitalBrains1
Copy link
Member Author

DigitalBrains1 commented Mar 26, 2023

And indeed instead of compiling with -rtsopts and then passing RTS options at runtime, it is also simply possible to compile with -with-rtsopts=-xm20000000, as you would expect to be the case.

In very specific tests in GitLab CI we are affected by GHC bug #19421.
We can work around the issue by passing `-with-rtsopts=-xm20000000` when
compiling an affected binary. This is a stopgap measure until the real
bug is fixed.

We have seen the bug in:
- In `clash-testsuite` in `clashLibTest`s
- In `ffi:example` in the `clash` binary itself
- In `prelude:doctests`, probably in the `doctests` binary itself,
  although this is not certain.

This workaround was applied only to those cases that were observed to go
wrong, although as a consequence now the `clash` binary is always built
with the RTS option.
@DigitalBrains1
Copy link
Member Author

Superseded by #2444

@DigitalBrains1 DigitalBrains1 deleted the retry-heisenbug branch March 29, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants