-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_syslog: test_syslog_threaded() crashs randomly on ARM64 macOS 3.x #98178
Comments
It's running macOS 11.6 / ARM64 (Aarch64): Apple M1 CPU. test.pythoninfo:
|
The crash started to occur at build 2780 which was triggered by b399115 : no idea if it's related to this change. |
It's unclear to me if the 3.11 branch is affected on this buildbot worker (ARM64 macOS). There was a single build on the 3.11 branch in the last 2 days (since the crash started to occur on the main branch): https://buildbot.python.org/all/#/builders/1030 I don't see any crash on the 3.11 branch for one month. |
Something else that might be useful to note is that both crashes you linked to were running https://buildbot.python.org/all/#/builders/725/builds/2780/steps/5/logs/stdio (line 297)
https://buildbot.python.org/all/#/builders/725/builds/2786/steps/5/logs/stdio (line 627)
Which doesnt seem to be the case for other non-crashing builds (ie https://buildbot.python.org/all/#/builders/725/builds/2785/steps/5/logs/stdio) Could these be related? |
b399115 is not the first crash. See https://buildbot.python.org/all/#/builders/725. First crash in recent times was 2 days ago https://buildbot.python.org/all/#/builders/725/builds/2780 during testing of commit d876528. |
I'm unable to reproduce neither with pydebug nor with an opt build, neither with -F nor with re-starting -m test thousands of times. Now I'm thinking this has to be an interdependence between different tests. The obvious candidates are test_thread* but I'm somehow thinking this will be test_logging, test_asyncio, or test_concurrent_futures. Trying with all those to find the culprit right now... no dice yet. |
Doesn't seem to be multiprocessing (although one of those runs is deadlocked): 1:00:09 load avg: 26.97 [358] test_syslog passed -- running: test_multiprocessing_spawn (32.1 sec), test_multiprocessing_spawn (51 min 18 sec), test_multiprocessing_spawn (1 min 4 sec), test_multiprocessing_spawn (1 min 11 sec), test_multiprocessing_spawn (30.3 sec) |
A crash log from the buildbot would be nice, that should contain a stack trace of the crash. BTW. cae7d1d recently added more syslog tests. That commit also has some refcounting fixes in _testcapimodule.c, which look sane to me. |
Also doesn't look like it's concurrent.futures related... 1:19:15 load avg: 10.72 [378] test_syslog passed -- running: test_concurrent_futures (1 min 5 sec), test_concurrent_futures (41.6 sec) I'm in the process of getting the crash log from the box for you, Ronald. |
Do we know for sure that the libc |
Indeed. This issue mentions ASL handles cannot be used concurrently on two threads and the syslog implementation uses a single global ASL handle. The most recent code dump from Apple uses a lock to create the ASL handle, but doesn't guard usage of said handle. This may well be an OS bug. Which begs the question: Is this something we want to work around by introducing a lock on our end, even if that isn't a 100% fix because this wouldn't guard syslog() calls from C code. If not, disabling the multi-threading test in test_syslog would likely avoid the crash by not calling syslog concurrently on our end. Sigh... |
@ronaldoussoren unfortunately it seems that buildbot python.exe crashes don't produce any output in our buildbot's |
BTW. I haven't been able to reproduce the issue on my end, but haven't tried tested on macOS 11 yet (the version used on the buildbot). |
I believe it is the case that those crash reports aren't produced by macOS for non-gui programs, i.e if Python is not running under an app bundle. A Python "framework build" fixes that. |
er, if so, that would need to be running from an installed framework build, so something like |
We already have such kind of lock: Modules/socketmodule.c uses:
gethostbyname() and getaddrinfo() are no thread-safe on some platforms. |
Another simple option is to not release the GIL on macOS: I created PR #98213 for that. |
I prefer this option. |
Another crash: https://buildbot.python.org/all/#/builders/725/builds/2794 |
I can reproduce this easily on the buidbot with
🤷 But OTOH i managed to get a crash file by crashing it by hand:
|
Ok, after activating developer mode I can lldb:
Here are the frames before entering the
|
I can confirm that calling the function with the GIL held fixes the problem. |
Calling with the GIL held fixes the crash for concurrent calls from Python, but that's the best we can do. I want to try to create a simple reproducer in C and file a issue with Apple about this (not that I expect this to have any effect :-( ). With some luck this might also tell us if newer versions of the OS fix the issue, which would allow us to drop the workaround later on. |
On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, so syslog.syslog() no longer releases the GIL to call it. (cherry picked from commit d4b9166) Co-authored-by: Victor Stinner <[email protected]>
On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, so syslog.syslog() no longer releases the GIL to call it.
On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, so syslog.syslog() no longer releases the GIL to call it. (cherry picked from commit d4b9166) Co-authored-by: Victor Stinner <[email protected]>
Well, for me it's quite clear the macOS syslog implementation is not thread-safe, and so using a lock (like the GIL) in Python is the right solution to make the code correct.
The source code can be found in Apple syslog source code:
But frame locall variables are not included in the trace, and I don't know how to get the C line numbre from the trace ( _asl_evaluate_send() uses a lock, but later, not at the beginning of the function which calls asl_msg_lookup(). I cannot find anything like a mutex or a lock in asl_msg.c. 2019, Someone else hitting a similar syslog() crash on macOS with threads: rigetti/rpcq#92 (comment)
asl manual page: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/asl.3.html
|
The Python
The syslog C API doesn't accept any "log handle": it's stored somewhere by the C library between openlog() and syslog() calls. The unit test is a stress test for multi-threading: it calls Python syslog.openlog() and syslog.syslog() functions in a loop from many threads in parallel. Since the GIL is released, magic happens :-) threads = [threading.Thread(target=opener)]
threads += [threading.Thread(target=logger) for k in range(10)]
with threading_helper.start_threads(threads):
start.set()
time.sleep(0.1)
stop = True It seems like these threads only work for 0.1 second. If you want to make the crash more likely, replace Maybe sometimes, if openlog() and syslog() are called at the same time from two threads, there is a race condition in syslog() and there is a crash. It's a little bit weird that openlog() and syslog() don't have any kind of locking since the API is somehow "state less": no "state" is passed to syslog(). The internal |
On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, so syslog.syslog() no longer releases the GIL to call it. (cherry picked from commit d4b9166) Co-authored-by: Victor Stinner <[email protected]>
On macOS, fix a crash in syslog.syslog() in multi-threaded applications. On macOS, the libc syslog() function is not thread-safe, so syslog.syslog() no longer releases the GIL to call it. (cherry picked from commit d4b9166) Co-authored-by: Victor Stinner <[email protected]>
Looking at the code I linked to earlier the problem is a race condition between calls to openlog and syslog inside libc.
What I don't understand at all is why someone took the effort of adding a mutex to guard changing the handle in openlog, but didn't add a similar guard when using the handle later on. In normal code your very unlikely to hit this race condition because applications generally only call openlog once, but still. |
My PR #98238 got merged, so I close the issue. If someone wants to fix/enhance macOS, I suggest you opening a bug report on the macOS side instead ;-) |
I tried that on macOS 12.6 and couldn't reproduce with 1:17:52 load avg: 61.52 [868] test_syslog passed (30.8 sec) I even tried to set the sleep to 1:34:06 load avg: 150.95 [14513] test_syslog passed Next I'll try my old Intel laptop with macOS 10.15. |
It looks like the issue is fixed in macOS 12, the latest sources are on GitHub instead of opensource.apple.com and manage the ASL handle more sanely: https://github.com/apple-oss-distributions/syslog/blob/syslog-392.100.2/libsystem_asl.tproj/src/syslog.c In this version the ASL handle used in syslog() is a new reference instead of a borrowed reference (to use Python terminology). That explains why we couldn't reproduce the issue on macOS 12 systems. |
In that case, the Python syslog module should hold or not the GIL depending on the macOS version. On macOS, the Python select module checks at runtime if poll() is broken or not, to workaround another macOS issue:
|
* main: (38 commits) pythongh-98251: Allow venv to pass along PYTHON* variables to pip and ensurepip when they do not impact path resolution (pythonGH-98259) Bpo-41246: IOCP Proactor avoid callback code duplication (python#21399) bpo-46364: Use sockets for stdin of asyncio only on AIX (python#30596) pythongh-98178: syslog() is not thread-safe on macOS (python#98213) Mark all targets in `Doc/Makefile` as `PHONY` (pythonGH-98189) pythongh-97982: Factorize PyUnicode_Count() and unicode_count() code (python#98025) pythongh-96265: Formatting changes for faq/general (python#98129) tutorial: remove "with single quotes" (python#98204) pythongh-97669: Remove Tools/scripts/startuptime.py (python#98214) signalmodule.c uses _PyErr_WriteUnraisableMsg() (python#98217) pythongh-97669: Fix test_tools reference leak (python#98216) pythongh-97669: Create Tools/patchcheck/ directory (python#98186) pythongh-65046: Link to logging cookbook from asyncio docs (python#98207) Formatting fixes in contextlib docs (python#98111) pythongh-95276: Add callable entry to the glossary (python#95738) pythongh-96130: Rephrase use of "typecheck" verb for clarity (python#98144) Fix some incorrect indentation around the main switch (python#98177) pythongh-98172: Fix formatting in `except*` docs (python#98173) pythongh-97982: Remove asciilib_count() (python#98164) pythongh-95756: Free and NULL-out code caches when needed (pythonGH-98181) ...
* main: (37 commits) pythongh-98251: Allow venv to pass along PYTHON* variables to pip and ensurepip when they do not impact path resolution (pythonGH-98259) Bpo-41246: IOCP Proactor avoid callback code duplication (python#21399) bpo-46364: Use sockets for stdin of asyncio only on AIX (python#30596) pythongh-98178: syslog() is not thread-safe on macOS (python#98213) Mark all targets in `Doc/Makefile` as `PHONY` (pythonGH-98189) pythongh-97982: Factorize PyUnicode_Count() and unicode_count() code (python#98025) pythongh-96265: Formatting changes for faq/general (python#98129) tutorial: remove "with single quotes" (python#98204) pythongh-97669: Remove Tools/scripts/startuptime.py (python#98214) signalmodule.c uses _PyErr_WriteUnraisableMsg() (python#98217) pythongh-97669: Fix test_tools reference leak (python#98216) pythongh-97669: Create Tools/patchcheck/ directory (python#98186) pythongh-65046: Link to logging cookbook from asyncio docs (python#98207) Formatting fixes in contextlib docs (python#98111) pythongh-95276: Add callable entry to the glossary (python#95738) pythongh-96130: Rephrase use of "typecheck" verb for clarity (python#98144) Fix some incorrect indentation around the main switch (python#98177) pythongh-98172: Fix formatting in `except*` docs (python#98173) pythongh-97982: Remove asciilib_count() (python#98164) pythongh-95756: Free and NULL-out code caches when needed (pythonGH-98181) ...
test_syslog crashed 4 times in the last 2 days, each time in the test_syslog_threaded() function, on the "ARM64 macOS 3.x" buildbot:
Example of crash:
This crash is on the
syslog.syslog()
call in a thread. Code:The text was updated successfully, but these errors were encountered: