Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libflux: drop child watchers and the FLUX_REACTOR_SIGCHLD flag #6543

Merged
merged 8 commits into from
Jan 9, 2025

Conversation

garlick
Copy link
Member

@garlick garlick commented Jan 8, 2025

This drops child watchers and the FLUX_REACTOR_SIGCHLD flag from the public API, as discussed in #6512. The assumption is that if flux users are managing subprocesses they should be using the subprocess API or making the documented rexec calls to the broker.

Use a SIGCHLD signal watcher in libsubprocess instead of child watchers. This was the only known user.

Besides simplifying the public API, this eliminates a roadblock to swapping out the internal libev for libuv.

Problem: several subprocess unit tests check for file descriptor
leaks but don't account for file descriptors opened by the
reactor itself at runtime, such as for signalfd(2).

Move fd sampling period to enclose reactor creation/destruction.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This LGTM!

@garlick
Copy link
Member Author

garlick commented Jan 9, 2025

Pushing a fix for this ASAN failure:

2025-01-08T15:17:27.8865523Z FAIL: test_subprocess.t 253 - 5 file descriptors are open (expected 3-4)
2025-01-08T15:17:27.8865944Z #   Failed test '5 file descriptors are open (expected 3-4)'
2025-01-08T15:17:27.8866240Z #   at test/subprocess.c line 1127.
2025-01-08T15:17:27.8866485Z # test_fdcleanup posix-spawn
2025-01-08T15:17:27.8866942Z ok 254 - flux_cmd_create
2025-01-08T15:17:27.8867166Z PASS: test_subprocess.t 254 - flux_cmd_create
2025-01-08T15:17:27.8867436Z ok 255 - flux_local_exec posix-spawn
2025-01-08T15:17:27.8867720Z PASS: test_subprocess.t 255 - flux_local_exec posix-spawn
2025-01-08T15:17:27.8868138Z # stdout: 3
2025-01-08T15:17:27.8868361Z ok 256 - subprocess state == EXITED in completion handler
2025-01-08T15:17:27.8868781Z PASS: test_subprocess.t 256 - subprocess state == EXITED in completion handler
2025-01-08T15:17:27.8869396Z ok 257 - subprocess status is valid
2025-01-08T15:17:27.8869864Z PASS: test_subprocess.t 257 - subprocess status is valid
2025-01-08T15:17:27.8870366Z ok 258 - subprocess exit code is 0, got 0
2025-01-08T15:17:27.8870899Z PASS: test_subprocess.t 258 - subprocess exit code is 0, got 0
2025-01-08T15:17:27.8871455Z ok 259 - flux_reactor_run returned zero status
2025-01-08T15:17:27.8872012Z PASS: test_subprocess.t 259 - flux_reactor_run returned zero status
2025-01-08T15:17:27.8872385Z ok 260 - completion callback called 1 time
2025-01-08T15:17:27.8872707Z PASS: test_subprocess.t 260 - completion callback called 1 time
2025-01-08T15:17:27.8873209Z ok 261 - 3 file descriptors are open (expected 3-4)
2025-01-08T15:17:27.8873584Z PASS: test_subprocess.t 261 - 3 file descriptors are open (expected 3-4)
2025-01-08T15:17:27.8873923Z ok 262 - no file descriptors leaked
2025-01-08T15:17:27.8874200Z PASS: test_subprocess.t 262 - no file descriptors leaked
2025-01-08T15:17:27.8874463Z 1..262
2025-01-08T15:17:27.8874648Z # Looks like you failed 1 test of 262 run.
2025-01-08T15:17:27.8874924Z ERROR: test_subprocess.t - exited with status 1
2025-01-08T15:17:27.8875521Z ##[endgroup]
2025-01-08T15:17:27.8876434Z ##[error]ERROR: test_iostress.t - missing test plan
2025-01-08T15:17:27.8878184Z ##[error]ERROR: test_iostress.t - exited with status 1
2025-01-08T15:17:27.8879730Z ##[error]AddressSanitizer: heap-use-after-free on address 0x615000013990 at pc 0x00000041e4b7 bp 0x7ffff57ff790 sp 0x7ffff57ff788
2025-01-08T15:17:27.8880767Z ##[group]./src/common/libsubprocess/test_iostress.log
2025-01-08T15:17:27.8881065Z =================================================================
2025-01-08T15:17:27.8881594Z ==53000==ERROR: AddressSanitizer: heap-use-after-free on address 0x615000013990 at pc 0x00000041e4b7 bp 0x7ffff57ff790 sp 0x7ffff57ff788
2025-01-08T15:17:27.8882128Z WRITE of size 4 at 0x615000013990 thread T1
2025-01-08T15:17:27.8882470Z     #0 0x41e4b6 in sigchld_cb /usr/src/src/common/libsubprocess/local.c:457
2025-01-08T15:17:27.8882889Z     #1 0x41e1f9 in sigchld_cb /usr/src/src/common/libsubprocess/sigchld.c:103
2025-01-08T15:17:27.8883318Z     #2 0x7ffff785759f in ev_invoke_pending /usr/src/src/common/libev/ev.c:3770
2025-01-08T15:17:27.8883714Z     #3 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4190
2025-01-08T15:17:27.8884075Z     #4 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4021
2025-01-08T15:17:27.8884481Z     #5 0x7ffff77ad409 in flux_reactor_run /usr/src/src/common/libflux/reactor.c:92
2025-01-08T15:17:27.8884847Z     #6 0x4086e9 in test_server test/rcmdsrv.c:65
2025-01-08T15:17:27.8885189Z     #7 0x40880c in thread_wrapper /usr/src/src/common/libtestutil/util.c:51
2025-01-08T15:17:27.8885580Z     #8 0x7ffff7441e2c in start_thread (/lib64/libc.so.6+0x8ce2c)
2025-01-08T15:17:27.8886142Z     #9 0x7ffff74c67d3 in __clone (/lib64/libc.so.6+0x1117d3)
2025-01-08T15:17:27.8886663Z 
2025-01-08T15:17:27.8887038Z 0x615000013990 is located 144 bytes inside of 456-byte region [0x615000013900,0x615000013ac8)
2025-01-08T15:17:27.8887707Z freed by thread T1 here:
2025-01-08T15:17:27.8888269Z     #0 0x7ffff79d5368 in __interceptor_free.part.0 (/usr/lib64/libasan.so.8+0xb9368)
2025-01-08T15:17:27.8889055Z     #1 0x4145bc in subprocess_free /usr/src/src/common/libsubprocess/subprocess.c:151
2025-01-08T15:17:27.8889512Z     #2 0x7ffff785759f in ev_invoke_pending /usr/src/src/common/libev/ev.c:3770
2025-01-08T15:17:27.8889959Z     #3 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4190
2025-01-08T15:17:27.8890330Z     #4 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4021
2025-01-08T15:17:27.8890731Z     #5 0x7ffff77ad409 in flux_reactor_run /usr/src/src/common/libflux/reactor.c:92
2025-01-08T15:17:27.8891183Z     #6 0x4086e9 in test_server test/rcmdsrv.c:65
2025-01-08T15:17:27.8891754Z     #7 0x40880c in thread_wrapper /usr/src/src/common/libtestutil/util.c:51
2025-01-08T15:17:27.8892148Z     #8 0x7ffff7441e2c in start_thread (/lib64/libc.so.6+0x8ce2c)
2025-01-08T15:17:27.8892381Z 
2025-01-08T15:17:27.8892477Z previously allocated by thread T1 here:
2025-01-08T15:17:27.8892794Z     #0 0x7ffff79d6077 in calloc (/usr/lib64/libasan.so.8+0xba077)
2025-01-08T15:17:27.8893229Z     #1 0x4146cb in subprocess_create /usr/src/src/common/libsubprocess/subprocess.c:168
2025-01-08T15:17:27.8893735Z     #2 0x416bdd in flux_local_exec_ex /usr/src/src/common/libsubprocess/subprocess.c:511
2025-01-08T15:17:27.8894201Z     #3 0x40fef3 in server_exec_cb /usr/src/src/common/libsubprocess/server.c:404
2025-01-08T15:17:27.8894647Z     #4 0x7ffff77b11c8 in call_handler /usr/src/src/common/libflux/msg_handler.c:321
2025-01-08T15:17:27.8895125Z     #5 0x7ffff77b1ba2 in dispatch_message /usr/src/src/common/libflux/msg_handler.c:357
2025-01-08T15:17:27.8895592Z     #6 0x7ffff77b1ba2 in handle_cb /usr/src/src/common/libflux/msg_handler.c:450
2025-01-08T15:17:27.8896025Z     #7 0x7ffff785759f in ev_invoke_pending /usr/src/src/common/libev/ev.c:3770
2025-01-08T15:17:27.8896418Z     #8 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4190
2025-01-08T15:17:27.8897005Z     #9 0x7ffff7864cb4 in ev_run /usr/src/src/common/libev/ev.c:4021
2025-01-08T15:17:27.8897418Z     #10 0x7ffff77ad409 in flux_reactor_run /usr/src/src/common/libflux/reactor.c:92
2025-01-08T15:17:27.8897785Z     #11 0x4086e9 in test_server test/rcmdsrv.c:65
2025-01-08T15:17:27.8910192Z     #12 0x40880c in thread_wrapper /usr/src/src/common/libtestutil/util.c:51
2025-01-08T15:17:27.8910644Z     #13 0x7ffff7441e2c in start_thread (/lib64/libc.so.6+0x8ce2c)
2025-01-08T15:17:27.8910894Z 
2025-01-08T15:17:27.8910981Z Thread T1 created by T0 here:
2025-01-08T15:17:27.8911354Z     #0 0x7ffff79673e6 in __interceptor_pthread_create (/usr/lib64/libasan.so.8+0x4b3e6)
2025-01-08T15:17:27.8911851Z     #1 0x40901f in test_server_create /usr/src/src/common/libtestutil/util.c:177
2025-01-08T15:17:27.8912213Z     #2 0x40877b in rcmdsrv_create test/rcmdsrv.c:85
2025-01-08T15:17:27.8912496Z     #3 0x406080 in main test/iostress.c:366
2025-01-08T15:17:27.8912832Z     #4 0x7ffff73de54f in __libc_start_call_main (/lib64/libc.so.6+0x2954f)
2025-01-08T15:17:27.8913083Z 
2025-01-08T15:17:27.8913389Z SUMMARY: AddressSanitizer: heap-use-after-free /usr/src/src/common/libsubprocess/local.c:457 in sigchld_cb
2025-01-08T15:17:27.8913862Z Shadow bytes around the buggy address:
2025-01-08T15:17:27.8914182Z   0x0c2a7fffa6e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8914559Z   0x0c2a7fffa6f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8914925Z   0x0c2a7fffa700: fd fd fd fd fd fd fd fd fd fd fd fd fd fa fa fa
2025-01-08T15:17:27.8915275Z   0x0c2a7fffa710: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
2025-01-08T15:17:27.8915859Z   0x0c2a7fffa720: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8916350Z =>0x0c2a7fffa730: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8917123Z   0x0c2a7fffa740: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8917751Z   0x0c2a7fffa750: fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa
2025-01-08T15:17:27.8918376Z   0x0c2a7fffa760: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
2025-01-08T15:17:27.8919215Z   0x0c2a7fffa770: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8919903Z   0x0c2a7fffa780: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2025-01-08T15:17:27.8920589Z Shadow byte legend (one shadow byte represents 8 application bytes):
2025-01-08T15:17:27.8921163Z   Addressable:           00
2025-01-08T15:17:27.8921573Z   Partially addressable: 01 02 03 04 05 06 07 
2025-01-08T15:17:27.8922020Z   Heap left redzone:       fa
2025-01-08T15:17:27.8922401Z   Freed heap region:       fd
2025-01-08T15:17:27.8922773Z   Stack left redzone:      f1
2025-01-08T15:17:27.8923393Z   Stack mid redzone:       f2
2025-01-08T15:17:27.8923768Z   Stack right redzone:     f3
2025-01-08T15:17:27.8924140Z   Stack after return:      f5
2025-01-08T15:17:27.8924518Z   Stack use after scope:   f8
2025-01-08T15:17:27.8924886Z   Global redzone:          f9
2025-01-08T15:17:27.8925242Z   Global init order:       f6
2025-01-08T15:17:27.8925658Z   Poisoned by user:        f7
2025-01-08T15:17:27.8926043Z   Container overflow:      fc
2025-01-08T15:17:27.8926430Z   Array cookie:            ac
2025-01-08T15:17:27.8927027Z   Intra object redzone:    bb
2025-01-08T15:17:27.8927387Z   ASan internal:           fe
2025-01-08T15:17:27.8927751Z   Left alloca redzone:     ca
2025-01-08T15:17:27.8928115Z   Right alloca redzone:    cb
2025-01-08T15:17:27.8928475Z ==53000==ABORTING
2025-01-08T15:17:27.8928833Z ERROR: test_iostress.t - missing test plan
2025-01-08T15:17:27.8929315Z ERROR: test_iostress.t - exited with status 1

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 20 lines in your changes missing coverage. Please review.

Project coverage is 83.96%. Comparing base (02ad07c) to head (96df522).

Files with missing lines Patch % Lines
src/common/libsubprocess/sigchld.c 82.79% 16 Missing ⚠️
src/cmd/builtin/proxy.c 50.00% 1 Missing ⚠️
src/common/libsubprocess/local.c 75.00% 1 Missing ⚠️
src/common/libsubprocess/server.c 80.00% 1 Missing ⚠️
src/common/libsubprocess/subprocess.c 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6543      +/-   ##
==========================================
+ Coverage   83.66%   83.96%   +0.30%     
==========================================
  Files         523      515       -8     
  Lines       88086    86092    -1994     
==========================================
- Hits        73693    72291    -1402     
+ Misses      14393    13801     -592     
Files with missing lines Coverage Δ
src/broker/broker.c 77.87% <100.00%> (+0.76%) ⬆️
src/cmd/flux-start.c 84.21% <100.00%> (+0.28%) ⬆️
src/cmd/flux-terminus.c 82.18% <100.00%> (+0.21%) ⬆️
src/common/libflux/reactor.c 95.00% <100.00%> (-0.32%) ⬇️
src/common/libflux/watcher_wrap.c 97.03% <ø> (+0.65%) ⬆️
src/shell/shell.c 79.55% <100.00%> (+0.23%) ⬆️
src/cmd/builtin/proxy.c 78.28% <50.00%> (-0.65%) ⬇️
src/common/libsubprocess/local.c 84.51% <75.00%> (+1.25%) ⬆️
src/common/libsubprocess/server.c 80.42% <80.00%> (+0.04%) ⬆️
src/common/libsubprocess/subprocess.c 89.29% <87.50%> (+0.09%) ⬆️
... and 1 more

... and 185 files with indirect coverage changes

Problem: libsubprocess unit tests occasionally hang when run with
a reactor that uses EVFLAG_NOSIGMASK, which may be required.

Block SIGCHLD in the test server before spawning threads to avoid
it being delivered to the client thread occasionally.
Problem: child watchers are only used by libsubprocess, and probably
do not need to be in the public API since libsubprocess is provided
for managing child processes.  In addition, we are considering porting
Flux to libuv and libuv does not offer similar functionality.

Register a signal watcher for SIGCHLD within libsubprocess that
persists as long as there are subprocesses to monitor, and calls
waitpid(2) to consume all subprocess state changes.  Add a hash by pid
and allow subprocess objects to register a callback to receive these
changes for a given pid.

Have all subprocess users create reactors without FLUX_REACTOR_SIGCHLD,
as the default ev_loop registers a SIGCHLD watcher that conflicts with
this one.

Add EVFLAG_SIGNALFD to the non-default loop, as that flag was used with
reactors created with FLUX_REACTOR_SIGCHLD, and appears to be be required
to avoid sharness tests hanging randomly.
Problem: child watchers have no users, are likely not required in the
public API, and inhibit changing the internal reactor to libuv.

Drop them.

Fixes flux-framework#6512
Problem: the FLUX_REACTOR_SIGCHLD flag has no users.

Drop this flag.
Update unit test.
Problem: child watchers are removed from the public API but
man pages remain.

Drop man pages.
Problem: FLUX_REACTOR_SIGCHLD has been dropped from the public
API but it is still mentioned in the man page.

Drop it from the man page.
Problem: signals should generally only be handled in the main thread
of a multi-threaded program, but this is undocumented.

Add a note to the signal watcher man page.
@garlick
Copy link
Member Author

garlick commented Jan 9, 2025

Just did a bit of cleanup, simplifying the commit message of the main libsubprocess commit, and splitting out the libtestutil change to signal handling which is independent. I also added a note about signal watchers in multi-threaded programs to flux_signal_watcher_create(3).

@garlick
Copy link
Member Author

garlick commented Jan 9, 2025

Alright, thanks for the review. I'll set MWP.

@mergify mergify bot merged commit 350b7fc into flux-framework:master Jan 9, 2025
33 checks passed
@garlick garlick deleted the no_child_watcher branch January 9, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants