-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable ThreadSanitizer in the Inria CI #12644
Conversation
I agree that having some TSan testing in our CI is a good idea, so I support the broad move. I don't know what is the best approach to enable it, and I would defer the question to our CI experts. Note: if TSan is slow to run the testsuite, we could consider just building it and running a very basic test. I would guess that the tests that see the largest slowdown are the multicore "burn" tests which are already known as hard to size appropriately; there may be improvements to do in this area but this requires more work and should not be a blocker for some sort of testing support. ( Compiler testsuite curation is an area that we know needs work and where help is welcome. ) |
My working assumption is that there should not be many (if any) PRs which will break TSAN. I gather it's already the case that PRs which look like they might affect it are being manually tested with it! If we add it to the sanitizers job, and then find we have a PR a month which is actually breaking it, perhaps we could then look at moving the check to pull requests (any failures observed would also inform how useful in practice the idea of a reduced testsuite may be, for example). |
This makes sense, but running the testsuite has the positive effect of possibly triggering new data races that would be introduced in the runtime—at least, that’s what we have observed up to now. |
It makes sense to me to test TSAN on the Inria Jenkins CI, either as part of the "sanitizers" test like you do here, or (if it is too slow) as a separate test. The only caveat is that the machine running this test is currently stuck at Clang 13 because of a fairly old Ubuntu, but it should be upgraded to Ubuntu 22.04 soon and then we'll be able to use more recent Clang versions if desired. |
TSan support should work with all versions of Clang strarting from 11. There seems to be an agreement that running TSan in the Inria CI is relevant for now. I’m happy to move it to a separate test if we decide that ~50 min is too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to reinstall this CI test and see what happens. Minor suggestion below.
tools/ci/inria/sanitizers/script
Outdated
./configure \ | ||
--enable-tsan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about leaving CC=clang-13
? I feel slightly more confident if we know exactly which C compiler is being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The CI server is now running Ubuntu 22.04 LTS and has all versions of clang available up to 18. Maybe I'll bump the clang version used in this test from 13 to 14, as clang 14 is the default in Ubuntu 22.04. But clang 18 will come very handy to test C23 preliminary support! |
…caused too many reports in OCaml 5 and was disabled (see ocaml#11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see ocaml#11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime. A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol ocaml#12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals ocaml#12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.
45abc09
to
e85a100
Compare
Thank you for your comments. I did not include a Changes entry, not sure if one is required. |
I will run a |
Precheck run(ning): https://ci.inria.fr/ocaml/job/precheck/905/ |
... and I learned that I don't know how the INRIA CI works at all anymore, so I should probably leave this stuff to other people. (By clicking in various places I saw a CI failure which suggests that the current |
Handy guide: "precheck" = "main" on a user-provided repo. All other jobs ("sanitizers", "other-configs", etc) apply only to trunk and release branches of the ocaml/ocaml repo.
Only in 4.14. I pushed the fix. |
Running the script manually on the CI machine reports 5 failed tests:
catch_break was discussed elsewhere, I think. For perform.ml, the error trace is different:
Does this ring a bell? |
At any rate, we're not re-adding the test until it reports no errors. This means turning off the catch_break test if TSAN is enabled, and either solving the issues with the three tsan/ tests or disabling them until then. |
I can reproduce locally the failures with Clang 13. They are no longer there with Clang 14. |
I will investigate what causes this failure in Clang 13. |
clang 13 thread sanitizer produces different, less precise traces. Also, clang 14 is the default version in Ubuntu 22.04 LTS.
Thanks for the investigations! I confirm that using clang-14 the test works fine on the CI machine. Merging! |
Thank you. Naive question: where do the results appear? Is there a notification mechanism when that CI fails? |
This should be documented in https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci (If some details are missing, hopefully the documentation can be improved.) |
We're getting intermittent-but-frequent failures on two tests:
In both cases TSAN reports a data race. It should be possible to give you access to the full logs and the Jenkins CI system in general. Can you please email me and @Octachron about this? |
Note: #11040 is our meta-issue on current (and past) data races in the runtime. It would be interesting to eventually post these traces there. It requires some thinking about whether the trace is likely to be new, or come from a known issue already reported. We could leave this curation work to TSan experts like @OlivierNicole, but long-term ideally we mere runtime maintainers would be comfortable enough to do it ourselves. |
The races could be in the tests themselves, not in the runtime system. An expert like @OlivierNicole will know :-) |
My experience is that these "burn" tests are good at finding races in the runtime (for the same reason that you dislike them: often they overcommit resources... and thus they are good at hitting pathological schedules), so I would assume races in the runtime. |
I realized that I already had some a Jenkins account from some work I did a while back. Looking into the logs now… |
The failure of The failure of So one false positive in the runtime that refuses to be silenced, and a data race in a test. |
Does the Jenkins CI not show stderr? Sometimes the logs do not include TSan reports although there should be one, e.g. the recent https://ci.inria.fr/ocaml/job/sanitizers/2025/execution/node/16/log/?consoleFull |
Building the compiler with ThreadSanitizer and running the testsuite caused too many reports in OCaml 5 and was disabled (see #11040).
Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see #11040 again). As a result, running the testsuite with
--enable-tsan
is now a cheap and effective way of detecting new data races that may be introduced in the runtime.A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol #12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals #12561 (comment)).
Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.