Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable ThreadSanitizer in the Inria CI #12644

Merged
merged 4 commits into from
Oct 17, 2023

Conversation

OlivierNicole
Copy link
Contributor

Building the compiler with ThreadSanitizer and running the testsuite caused too many reports in OCaml 5 and was disabled (see #11040).

Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see #11040 again). As a result, running the testsuite with --enable-tsan is now a cheap and effective way of detecting new data races that may be introduced in the runtime.

A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol #12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals #12561 (comment)).

Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.

@gasche
Copy link
Member

gasche commented Oct 9, 2023

I agree that having some TSan testing in our CI is a good idea, so I support the broad move. I don't know what is the best approach to enable it, and I would defer the question to our CI experts.

Note: if TSan is slow to run the testsuite, we could consider just building it and running a very basic test.

I would guess that the tests that see the largest slowdown are the multicore "burn" tests which are already known as hard to size appropriately; there may be improvements to do in this area but this requires more work and should not be a blocker for some sort of testing support. ( Compiler testsuite curation is an area that we know needs work and where help is welcome. )

@dra27
Copy link
Member

dra27 commented Oct 9, 2023

My working assumption is that there should not be many (if any) PRs which will break TSAN. I gather it's already the case that PRs which look like they might affect it are being manually tested with it! If we add it to the sanitizers job, and then find we have a PR a month which is actually breaking it, perhaps we could then look at moving the check to pull requests (any failures observed would also inform how useful in practice the idea of a reduced testsuite may be, for example).

@OlivierNicole
Copy link
Contributor Author

Note: if TSan is slow to run the testsuite, we could consider just building it and running a very basic test.

This makes sense, but running the testsuite has the positive effect of possibly triggering new data races that would be introduced in the runtime—at least, that’s what we have observed up to now.

@xavierleroy
Copy link
Contributor

It makes sense to me to test TSAN on the Inria Jenkins CI, either as part of the "sanitizers" test like you do here, or (if it is too slow) as a separate test. The only caveat is that the machine running this test is currently stuck at Clang 13 because of a fairly old Ubuntu, but it should be upgraded to Ubuntu 22.04 soon and then we'll be able to use more recent Clang versions if desired.

@OlivierNicole
Copy link
Contributor Author

TSan support should work with all versions of Clang strarting from 11.

There seems to be an agreement that running TSan in the Inria CI is relevant for now. I’m happy to move it to a separate test if we decide that ~50 min is too long.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to reinstall this CI test and see what happens. Minor suggestion below.

Comment on lines 121 to 123
./configure \
--enable-tsan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about leaving CC=clang-13 ? I feel slightly more confident if we know exactly which C compiler is being tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xavierleroy
Copy link
Contributor

The CI server is now running Ubuntu 22.04 LTS and has all versions of clang available up to 18. Maybe I'll bump the clang version used in this test from 13 to 14, as clang 14 is the default in Ubuntu 22.04. But clang 18 will come very handy to test C23 preliminary support!

…caused too

many reports in OCaml 5 and was disabled (see ocaml#11040).

Since then, the work on TSan support for OCaml programs has led to fix a number
of those data races and temporarily silence the ones that are waiting to be
investigated (see ocaml#11040 again). As a result, running the testsuite with
`--enable-tsan` is now a cheap and effective way of detecting new data races
that may be introduced in the runtime.

A second good reason to restore the TSan CI is that it will detect early if a
recent change has accidentally broken TSan instrumentation (as has happened
before as an accidental consequence of removing a symbol
ocaml#12383 (review)), or
other issues (e.g. a new test revealed a TSan limitation with signals
ocaml#12561 (comment)).

Adding this test to the Github Actions CI arguably lengthens the runs (a GHA
run on amd64 Linux with TSan takes about 50 minutes). This PR therefore
suggests the compromise of enabling it on the Inria CI which is run on every
merge.
@OlivierNicole
Copy link
Contributor Author

Thank you for your comments. I did not include a Changes entry, not sure if one is required.

@gasche
Copy link
Member

gasche commented Oct 13, 2023

I will run a precheck on this PR to see if the CI actually passes.

@gasche
Copy link
Member

gasche commented Oct 13, 2023

Precheck run(ning): https://ci.inria.fr/ocaml/job/precheck/905/

@gasche
Copy link
Member

gasche commented Oct 13, 2023

... and I learned that I don't know how the INRIA CI works at all anymore, so I should probably leave this stuff to other people. (By clicking in various places I saw a CI failure which suggests that the current sanitizers configuration (unrelated to the present PR I think?) may need fixing.

@xavierleroy
Copy link
Contributor

xavierleroy commented Oct 15, 2023

I don't know how the INRIA CI works at all anymore

Handy guide: "precheck" = "main" on a user-provided repo. All other jobs ("sanitizers", "other-configs", etc) apply only to trunk and release branches of the ocaml/ocaml repo.

I saw a CI failure which suggests that the current sanitizers configuration (unrelated to the present PR I think?) may need fixing

Only in 4.14. I pushed the fix.

@xavierleroy
Copy link
Contributor

Running the script manually on the CI machine reports 5 failed tests:

    tests/tsan/'perform.ml' with 1.1 (native) 
    tests/tsan/'reperform.ml' with 1.1 (native) 
    tests/tsan/'unhandled.ml' with 1.1 (native) 
    tests/parallel/'catch_break.ml' with 1.1.2 (native) 
    tests/parallel/'catch_break.ml' with 1.1.1 (bytecode) 

catch_break was discussed elsewhere, I think. For perform.ml, the error trace is different:

    #0 camlPerform.race_<implemspecific> <implemspecific> (<implemspecific>)
     #1 camlPerform.h_<implemspecific> <implemspecific> (<implemspecific>)
-    #2 camlPerform.g_<implemspecific> <implemspecific> (<implemspecific>)
-    #3 camlPerform.f_<implemspecific> <implemspecific> (<implemspecific>)
-    #4 caml_runstack <implemspecific> (<implemspecific>)
+    #2 caml_tsan_entry_on_resume <implemspecific> (<implemspecific>)
+    #3 caml_tsan_entry_on_resume <implemspecific> (<implemspecific>)
+    #4 caml_resume <implemspecific> (<implemspecific>)
     #5 camlPerform.fun_<implemspecific> <implemspecific> (<implemspecific>)
     #6 camlPerform.main_<implemspecific> <implemspecific> (<implemspecific>)
     #7 camlPerform.entry <implemspecific> (<implemspecific>)

Does this ring a bell?

@xavierleroy
Copy link
Contributor

xavierleroy commented Oct 15, 2023

At any rate, we're not re-adding the test until it reports no errors. This means turning off the catch_break test if TSAN is enabled, and either solving the issues with the three tsan/ tests or disabling them until then.

@OlivierNicole
Copy link
Contributor Author

I can reproduce locally the failures with Clang 13. They are no longer there with Clang 14.

@OlivierNicole
Copy link
Contributor Author

I will investigate what causes this failure in Clang 13.

clang 13 thread sanitizer produces different, less precise traces.
Also, clang 14 is the default version in Ubuntu 22.04 LTS.
@xavierleroy
Copy link
Contributor

Thanks for the investigations! I confirm that using clang-14 the test works fine on the CI machine. Merging!

@xavierleroy xavierleroy merged commit 4042ca3 into ocaml:trunk Oct 17, 2023
8 of 9 checks passed
@OlivierNicole
Copy link
Contributor Author

Thank you. Naive question: where do the results appear? Is there a notification mechanism when that CI fails?

@gasche
Copy link
Member

gasche commented Oct 19, 2023

This should be documented in

https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci

(If some details are missing, hopefully the documentation can be improved.)

@xavierleroy
Copy link
Contributor

We're getting intermittent-but-frequent failures on two tests:

tests/parallel/'domain_parallel_spawn_burn.ml' with 1 (native) 
tests/weak-ephe-final/'weaktest_par_load.ml' with 2 (native) 

In both cases TSAN reports a data race.

It should be possible to give you access to the full logs and the Jenkins CI system in general. Can you please email me and @Octachron about this?

@gasche
Copy link
Member

gasche commented Oct 19, 2023

Note: #11040 is our meta-issue on current (and past) data races in the runtime. It would be interesting to eventually post these traces there. It requires some thinking about whether the trace is likely to be new, or come from a known issue already reported. We could leave this curation work to TSan experts like @OlivierNicole, but long-term ideally we mere runtime maintainers would be comfortable enough to do it ourselves.

@xavierleroy
Copy link
Contributor

The races could be in the tests themselves, not in the runtime system. An expert like @OlivierNicole will know :-)

@gasche
Copy link
Member

gasche commented Oct 19, 2023

My experience is that these "burn" tests are good at finding races in the runtime (for the same reason that you dislike them: often they overcommit resources... and thus they are good at hitting pathological schedules), so I would assume races in the runtime.

@OlivierNicole
Copy link
Contributor Author

I realized that I already had some a Jenkins account from some work I did a while back. Looking into the logs now…

@OlivierNicole
Copy link
Contributor Author

OlivierNicole commented Oct 19, 2023

The failure of weaktest_par_load is an instance of #12282. This spurious report is supposed to be silenced, but unfortunately clang disregards the no_tsan attribute when the function is inlined. I’m working on a PR.

The failure of domain_parallel_spawn_burn is a very conspicuous data race on running, which was not detected until now because it only appears when the testsuite is run with OCAML_TEST_SIZE ⩾ 2.

So one false positive in the runtime that refuses to be silenced, and a data race in a test.

@OlivierNicole
Copy link
Contributor Author

Does the Jenkins CI not show stderr? Sometimes the logs do not include TSan reports although there should be one, e.g. the recent https://ci.inria.fr/ocaml/job/sanitizers/2025/execution/node/16/log/?consoleFull

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants