Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes and locks under Windows MingW #12230

Closed
jmid opened this issue May 8, 2023 · 5 comments
Closed

Crashes and locks under Windows MingW #12230

jmid opened this issue May 8, 2023 · 5 comments
Assignees
Labels

Comments

@jmid
Copy link
Contributor

jmid commented May 8, 2023

Since 5.0 we have observed that a combination of Domains and Threads can cause either segfaults or dead/live-locks in the MingW Windows port. We have observed the issue when testing both backends (native code and bytecode) but it seems easier to trigger in bytecode mode. We suspect that both kinds of failures may be caused by the same underlying problem.

The test itself generates a combination of Domains and Threads as a dependency tree, encoded as a record of arrays.
For the generation-part, there's a QCheck dependency (for now).

To recreate:

The last line of the above, simply repeats a bytecode version of the test until failure:

$ while  _build/default/src/threadomain/threadomain.bc -v -s 377546401; do :; done
random seed: 377546401
generated error fail pass / total     time test name
[✓]    3    0    0    3 /    3     6.0s Mash up of threads and domains
================================================================================
success (ran 1 tests)

[...]

random seed: 377546401
generated error fail pass / total     time test name
[ ]    2    0    0    2 /    3     2.3s Mash up of threads and domainsSegmentation fault

A live(or dead)lock is observed by no progress happening (and no QCheck callbacks executed to update the test status), after 2secs or so:

random seed: 377546401
generated error fail pass / total     time test name
[✓]    3    0    0    3 /    3     6.2s Mash up of threads and domains
================================================================================
success (ran 1 tests)
random seed: 377546401
generated error fail pass / total     time test name
[ ]    2    0    0    2 /    3     2.3s Mash up of threads and domains

For a while we have observed these timeouts and crashes occasionally on this test in our CI, but have struggled to cook up reproduction steps: ocaml-multicore/multicoretests#203

To get a sense of the behaviour here's a summary of 5 runs to get a sense of the behaviour:

  • segfault on iteration 1
  • dead/live-lock on iteration 6
  • segfault on iteration 18
  • dead/live-lock on iteration 5
  • segfault on iteration 18

Above I use the seed 377546401 which works on my machine/setup.
I initially found this particular seed by running the same loop with random seeds:

  while  _build/default/src/threadomain/threadomain.bc -v ; do :; done

Eventually this crashed on the 22th iteration on random seed: 377546401 which made me pass that with -s 377546401.
To recreate others may have more luck following the same process rather than simply using the same seed.

Credit to @shym for having written this nice torture instrument 😄

@Octachron Octachron added the bug label Jun 28, 2023
@gasche
Copy link
Member

gasche commented Jun 28, 2023

Who wants to do a deep dive of the Threads implementation on Windows. @nojb, @MisterDA, any volunteer?

@nojb
Copy link
Contributor

nojb commented Jun 28, 2023

Who wants to do a deep dive of the Threads implementation on Windows. @nojb, @MisterDA, any volunteer?

Unfortunately, I am flooded with other tasks at the moment, so I won't be able to get to do this anytime soon. Also, I don't know anything about threads :)

@jmid
Copy link
Contributor Author

jmid commented Jul 5, 2023

FWIW, I came across a related MingW bug report that ticks some of the right keywords.

I have yet to understand the new runtime well enough to judge whether this may be the underlying cause or a wild goose chase... 😬 Consider yourself warned! ⚠️

@jmid
Copy link
Contributor Author

jmid commented Mar 11, 2024

I believe this is fixed by #12882.

In more detail:

  • This is surprisingly tricky to reproduce. On a physical Windows laptop I struggled to do so - even for older OCaml versions known to have the defect. It is possible to trigger on CI though, e.g., here for 5.2.0~alpha1 (predating 12882). I've also been able to reproduce it relatively reliably on Windows under VirtualBox (which I used in the above instructions).
  • I've built a pre-12882 compiler from 1628c38 - Changes bookeeping (under VirtualBox) and confirmed that the above would crash/hang on it.
  • I've then built a post-12882 compiler from 572aeb5 - Merge pull request #12882 from gadmm/systhread_yield (also under VirtualBox) and been unable to either crash or hang it with the above instructions, despite extensive repetition (+45min).

As the test uses a random combination of systhreads and domains performing both computation and allocation, (in retrospect) it makes sense that unsafe systhread yielding could be causing it.

Let me know if you want me to perform more experiments in connection with this.

@jmid jmid closed this as completed Mar 11, 2024
@gasche
Copy link
Member

gasche commented Mar 11, 2024

This is good news, thanks for the investigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants