Re-enable ThreadSanitizer in the Inria CI #12644

OlivierNicole · 2023-10-09T10:31:05Z

Building the compiler with ThreadSanitizer and running the testsuite caused too many reports in OCaml 5 and was disabled (see #11040).

Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see #11040 again). As a result, running the testsuite with --enable-tsan is now a cheap and effective way of detecting new data races that may be introduced in the runtime.

A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol #12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals #12561 (comment)).

Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.

gasche · 2023-10-09T10:49:57Z

I agree that having some TSan testing in our CI is a good idea, so I support the broad move. I don't know what is the best approach to enable it, and I would defer the question to our CI experts.

Note: if TSan is slow to run the testsuite, we could consider just building it and running a very basic test.

I would guess that the tests that see the largest slowdown are the multicore "burn" tests which are already known as hard to size appropriately; there may be improvements to do in this area but this requires more work and should not be a blocker for some sort of testing support. ( Compiler testsuite curation is an area that we know needs work and where help is welcome. )

dra27 · 2023-10-09T14:04:19Z

My working assumption is that there should not be many (if any) PRs which will break TSAN. I gather it's already the case that PRs which look like they might affect it are being manually tested with it! If we add it to the sanitizers job, and then find we have a PR a month which is actually breaking it, perhaps we could then look at moving the check to pull requests (any failures observed would also inform how useful in practice the idea of a reduced testsuite may be, for example).

OlivierNicole · 2023-10-09T14:07:02Z

Note: if TSan is slow to run the testsuite, we could consider just building it and running a very basic test.

This makes sense, but running the testsuite has the positive effect of possibly triggering new data races that would be introduced in the runtime—at least, that’s what we have observed up to now.

xavierleroy · 2023-10-09T14:38:10Z

It makes sense to me to test TSAN on the Inria Jenkins CI, either as part of the "sanitizers" test like you do here, or (if it is too slow) as a separate test. The only caveat is that the machine running this test is currently stuck at Clang 13 because of a fairly old Ubuntu, but it should be upgraded to Ubuntu 22.04 soon and then we'll be able to use more recent Clang versions if desired.

OlivierNicole · 2023-10-12T09:58:51Z

TSan support should work with all versions of Clang strarting from 11.

There seems to be an agreement that running TSan in the Inria CI is relevant for now. I’m happy to move it to a separate test if we decide that ~50 min is too long.

xavierleroy

OK to reinstall this CI test and see what happens. Minor suggestion below.

xavierleroy · 2023-10-12T11:38:17Z

tools/ci/inria/sanitizers/script

+./configure \
+  --enable-tsan


What about leaving CC=clang-13 ? I feel slightly more confident if we know exactly which C compiler is being tested.

xavierleroy · 2023-10-12T11:40:05Z

The CI server is now running Ubuntu 22.04 LTS and has all versions of clang available up to 18. Maybe I'll bump the clang version used in this test from 13 to 14, as clang 14 is the default in Ubuntu 22.04. But clang 18 will come very handy to test C23 preliminary support!

…caused too many reports in OCaml 5 and was disabled (see ocaml#11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see ocaml#11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime. A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol ocaml#12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals ocaml#12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.

OlivierNicole · 2023-10-13T14:35:37Z

Thank you for your comments. I did not include a Changes entry, not sure if one is required.

gasche · 2023-10-13T18:40:12Z

I will run a precheck on this PR to see if the CI actually passes.

gasche · 2023-10-13T18:41:27Z

Precheck run(ning): https://ci.inria.fr/ocaml/job/precheck/905/

gasche · 2023-10-13T19:14:08Z

... and I learned that I don't know how the INRIA CI works at all anymore, so I should probably leave this stuff to other people. (By clicking in various places I saw a CI failure which suggests that the current sanitizers configuration (unrelated to the present PR I think?) may need fixing.

xavierleroy · 2023-10-15T15:01:25Z

I don't know how the INRIA CI works at all anymore

Handy guide: "precheck" = "main" on a user-provided repo. All other jobs ("sanitizers", "other-configs", etc) apply only to trunk and release branches of the ocaml/ocaml repo.

I saw a CI failure which suggests that the current sanitizers configuration (unrelated to the present PR I think?) may need fixing

Only in 4.14. I pushed the fix.

tools/ci/inria/sanitizers/script

xavierleroy · 2023-10-15T15:38:22Z

Running the script manually on the CI machine reports 5 failed tests:

    tests/tsan/'perform.ml' with 1.1 (native) 
    tests/tsan/'reperform.ml' with 1.1 (native) 
    tests/tsan/'unhandled.ml' with 1.1 (native) 
    tests/parallel/'catch_break.ml' with 1.1.2 (native) 
    tests/parallel/'catch_break.ml' with 1.1.1 (bytecode)

catch_break was discussed elsewhere, I think. For perform.ml, the error trace is different:

    #0 camlPerform.race_<implemspecific> <implemspecific> (<implemspecific>)
     #1 camlPerform.h_<implemspecific> <implemspecific> (<implemspecific>)
-    #2 camlPerform.g_<implemspecific> <implemspecific> (<implemspecific>)
-    #3 camlPerform.f_<implemspecific> <implemspecific> (<implemspecific>)
-    #4 caml_runstack <implemspecific> (<implemspecific>)
+    #2 caml_tsan_entry_on_resume <implemspecific> (<implemspecific>)
+    #3 caml_tsan_entry_on_resume <implemspecific> (<implemspecific>)
+    #4 caml_resume <implemspecific> (<implemspecific>)
     #5 camlPerform.fun_<implemspecific> <implemspecific> (<implemspecific>)
     #6 camlPerform.main_<implemspecific> <implemspecific> (<implemspecific>)
     #7 camlPerform.entry <implemspecific> (<implemspecific>)

Does this ring a bell?

xavierleroy · 2023-10-15T15:42:32Z

At any rate, we're not re-adding the test until it reports no errors. This means turning off the catch_break test if TSAN is enabled, and either solving the issues with the three tsan/ tests or disabling them until then.

OlivierNicole · 2023-10-16T09:36:49Z

I can reproduce locally the failures with Clang 13. They are no longer there with Clang 14.

OlivierNicole · 2023-10-16T14:03:18Z

I will investigate what causes this failure in Clang 13.

clang 13 thread sanitizer produces different, less precise traces. Also, clang 14 is the default version in Ubuntu 22.04 LTS.

xavierleroy · 2023-10-17T14:19:57Z

Thanks for the investigations! I confirm that using clang-14 the test works fine on the CI machine. Merging!

OlivierNicole · 2023-10-19T09:33:27Z

Thank you. Naive question: where do the results appear? Is there a notification mechanism when that CI fails?

gasche · 2023-10-19T09:38:47Z

This should be documented in

https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci

(If some details are missing, hopefully the documentation can be improved.)

xavierleroy · 2023-10-19T09:44:15Z

We're getting intermittent-but-frequent failures on two tests:

tests/parallel/'domain_parallel_spawn_burn.ml' with 1 (native) 
tests/weak-ephe-final/'weaktest_par_load.ml' with 2 (native)

In both cases TSAN reports a data race.

It should be possible to give you access to the full logs and the Jenkins CI system in general. Can you please email me and @Octachron about this?

gasche · 2023-10-19T10:10:33Z

Note: #11040 is our meta-issue on current (and past) data races in the runtime. It would be interesting to eventually post these traces there. It requires some thinking about whether the trace is likely to be new, or come from a known issue already reported. We could leave this curation work to TSan experts like @OlivierNicole, but long-term ideally we mere runtime maintainers would be comfortable enough to do it ourselves.

xavierleroy · 2023-10-19T10:26:11Z

The races could be in the tests themselves, not in the runtime system. An expert like @OlivierNicole will know :-)

gasche · 2023-10-19T10:31:05Z

My experience is that these "burn" tests are good at finding races in the runtime (for the same reason that you dislike them: often they overcommit resources... and thus they are good at hitting pathological schedules), so I would assume races in the runtime.

OlivierNicole · 2023-10-19T11:50:22Z

I realized that I already had some a Jenkins account from some work I did a while back. Looking into the logs now…

OlivierNicole · 2023-10-19T14:27:13Z

The failure of weaktest_par_load is an instance of #12282. This spurious report is supposed to be silenced, but unfortunately clang disregards the no_tsan attribute when the function is inlined. I’m working on a PR.

The failure of domain_parallel_spawn_burn is a very conspicuous data race on running, which was not detected until now because it only appears when the testsuite is run with OCAML_TEST_SIZE ⩾ 2.

So one false positive in the runtime that refuses to be silenced, and a data race in a test.

OlivierNicole · 2023-11-10T12:23:06Z

Does the Jenkins CI not show stderr? Sometimes the logs do not include TSan reports although there should be one, e.g. the recent https://ci.inria.fr/ocaml/job/sanitizers/2025/execution/node/16/log/?consoleFull

dra27 added the no-change-entry-needed label Oct 9, 2023

xavierleroy approved these changes Oct 12, 2023

View reviewed changes

fabbing mentioned this pull request Oct 12, 2023

More reliable tests for TSan #12659

Merged

OlivierNicole force-pushed the restore_tsan_ci branch from 45abc09 to e85a100 Compare October 13, 2023 14:34

xavierleroy reviewed Oct 15, 2023

View reviewed changes

tools/ci/inria/sanitizers/script Outdated Show resolved Hide resolved

xavierleroy reviewed Oct 15, 2023

View reviewed changes

tools/ci/inria/sanitizers/script Outdated Show resolved Hide resolved

OlivierNicole added 2 commits October 16, 2023 10:54

Add missing backslash in script

1a00f8c

Disable tests parallel/catch_break with tsan

e38ed7f

CI sanitizers: Use clang 14

a78264c

clang 13 thread sanitizer produces different, less precise traces. Also, clang 14 is the default version in Ubuntu 22.04 LTS.

xavierleroy merged commit 4042ca3 into ocaml:trunk Oct 17, 2023
8 of 9 checks passed

OlivierNicole mentioned this pull request Oct 20, 2023

Fix TSan false positives due to volatile write handling #12681

Merged

OlivierNicole deleted the restore_tsan_ci branch October 25, 2023 10:00

OlivierNicole mentioned this pull request Oct 27, 2023

ThreadSanitizer issues #11040

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable ThreadSanitizer in the Inria CI #12644

Re-enable ThreadSanitizer in the Inria CI #12644

OlivierNicole commented Oct 9, 2023

gasche commented Oct 9, 2023

dra27 commented Oct 9, 2023

OlivierNicole commented Oct 9, 2023

xavierleroy commented Oct 9, 2023

OlivierNicole commented Oct 12, 2023

xavierleroy left a comment

xavierleroy Oct 12, 2023

OlivierNicole Oct 13, 2023

xavierleroy commented Oct 12, 2023

OlivierNicole commented Oct 13, 2023

gasche commented Oct 13, 2023

gasche commented Oct 13, 2023

gasche commented Oct 13, 2023

xavierleroy commented Oct 15, 2023 •

edited

Loading

xavierleroy commented Oct 15, 2023

xavierleroy commented Oct 15, 2023 •

edited

Loading

OlivierNicole commented Oct 16, 2023

OlivierNicole commented Oct 16, 2023

xavierleroy commented Oct 17, 2023

OlivierNicole commented Oct 19, 2023

gasche commented Oct 19, 2023

xavierleroy commented Oct 19, 2023

gasche commented Oct 19, 2023 •

edited

Loading

xavierleroy commented Oct 19, 2023

gasche commented Oct 19, 2023

OlivierNicole commented Oct 19, 2023

OlivierNicole commented Oct 19, 2023 •

edited

Loading

OlivierNicole commented Nov 10, 2023

		./configure \
		--enable-tsan

Re-enable ThreadSanitizer in the Inria CI #12644

Re-enable ThreadSanitizer in the Inria CI #12644

Conversation

OlivierNicole commented Oct 9, 2023

gasche commented Oct 9, 2023

dra27 commented Oct 9, 2023

OlivierNicole commented Oct 9, 2023

xavierleroy commented Oct 9, 2023

OlivierNicole commented Oct 12, 2023

xavierleroy left a comment

Choose a reason for hiding this comment

xavierleroy Oct 12, 2023

Choose a reason for hiding this comment

OlivierNicole Oct 13, 2023

Choose a reason for hiding this comment

xavierleroy commented Oct 12, 2023

OlivierNicole commented Oct 13, 2023

gasche commented Oct 13, 2023

gasche commented Oct 13, 2023

gasche commented Oct 13, 2023

xavierleroy commented Oct 15, 2023 • edited Loading

xavierleroy commented Oct 15, 2023

xavierleroy commented Oct 15, 2023 • edited Loading

OlivierNicole commented Oct 16, 2023

OlivierNicole commented Oct 16, 2023

xavierleroy commented Oct 17, 2023

OlivierNicole commented Oct 19, 2023

gasche commented Oct 19, 2023

xavierleroy commented Oct 19, 2023

gasche commented Oct 19, 2023 • edited Loading

xavierleroy commented Oct 19, 2023

gasche commented Oct 19, 2023

OlivierNicole commented Oct 19, 2023

OlivierNicole commented Oct 19, 2023 • edited Loading

OlivierNicole commented Nov 10, 2023

xavierleroy commented Oct 15, 2023 •

edited

Loading

xavierleroy commented Oct 15, 2023 •

edited

Loading

gasche commented Oct 19, 2023 •

edited

Loading

OlivierNicole commented Oct 19, 2023 •

edited

Loading