Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread race between qdr_delivery_anycast_propagate_CT and qdr_connection_process #1199

Closed
kgiusti opened this issue Sep 14, 2023 · 4 comments
Assignees
Milestone

Comments

@kgiusti
Copy link
Contributor

kgiusti commented Sep 14, 2023

60: WARNING: ThreadSanitizer: data race (pid=5272)
60:   Read of size 1 at 0x7b4c000a0c50 by thread T4:
60:     #0 qdr_connection_process /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:428 (skrouterd+0x4d439c)
60:     #1 AMQP_conn_wake_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:356 (skrouterd+0x51c767)
60:     #2 writable_handler.constprop.0 /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:387 (skrouterd+0x5218bf)
60:     #3 qd_conn_event_batch_complete /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:477 (skrouterd+0x48d4cb)
60:     #4 thread_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1159 (skrouterd+0x52c936)
60:     #5 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Previous write of size 1 at 0x7b4c000a0c50 by thread T2:
60:     #0 qdr_delivery_anycast_propagate_CT /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:851 (skrouterd+0x4e723e)
60:     #1 qdr_update_delivery_CT /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:741 (skrouterd+0x4e723e)
60:     #2 router_core_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core_thread.c:255 (skrouterd+0x4f7c80)
60:     #3 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Location is heap block of size 448 at 0x7b4c000a0ac0 allocated by thread T4:
60:     #0 posix_memalign <null> (libtsan.so.2+0x3f988)
60:     #1 qd_alloc /home/runner/work/skupper-router/skupper-router/skupper-router/src/alloc_pool.c:344 (skrouterd+0x470f9a)
60:     #2 new_qdr_delivery_t /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:24 (skrouterd+0x502ad3)
60:     #3 qdr_link_deliver_to /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/transfer.c:80 (skrouterd+0x502ad3)
60:     #4 AMQP_rx_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:904 (skrouterd+0x51db3a)
60:     #5 do_receive /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:219 (skrouterd+0x48ea85)
60:     #6 qd_container_handle_event /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:730 (skrouterd+0x48ea85)
60:     #7 handle /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1105 (skrouterd+0x52aef3)
60:     #8 thread_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1143 (skrouterd+0x52c865)
60:     #9 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Thread T4 'wrkr_1' (tid=5293, running) created by main thread at:
60:     #0 pthread_create <null> (libtsan.so.2+0x5f0e6)
60:     #1 sys_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:228 (skrouterd+0x4ac71a)
60:     #2 qd_server_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1514 (skrouterd+0x52d189)
60:     #3 main_process /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:111 (skrouterd+0x42fb00)
60:     #4 main /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:365 (skrouterd+0x429c8e)
60: 
60:   Thread T2 'core_thread' (tid=5284, running) created by main thread at:
60:     #0 pthread_create <null> (libtsan.so.2+0x5f0e6)
60:     #1 sys_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:228 (skrouterd+0x4ac71a)
60:     #2 qdr_core /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core.c:125 (skrouterd+0x4fc0cc)
60:     #3 qd_router_setup_late /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:2321 (skrouterd+0x52d778)
60:     #4 ffi_call_unix64 <null> (libffi.so.8+0x7a05)
60:     #5 main_process /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:101 (skrouterd+0x42faa2)
60:     #6 main /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:365 (skrouterd+0x429c8e)
60: 
60: SUMMARY: ThreadSanitizer: data race /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:428 in qdr_connection_process
@kgiusti kgiusti self-assigned this Sep 14, 2023
@kgiusti
Copy link
Contributor Author

kgiusti commented Sep 15, 2023

@ted-ross @ganeshmurthy

I think this is relatively benign. The race is caused by the core thread setting the settled flag in a downstream delivery at the same moment the I/O thread is checking that delivery settled flag.

I think this is benign because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag. Since "settled" is a latch (never goes from set to unset) this guarantees the downstream I/O thread will (eventually) see the flag's new state.

This code is old as the hills, so I'm at a loss as to explain why TSAN has started to complain about it now. It may be that some recent change removed a (totally unrelated) mutex that cause TSAN to see a flush. Dunno.

I'm thinking we suppress this. What do you all think?

@kgiusti
Copy link
Contributor Author

kgiusti commented Sep 15, 2023

Update: I can repro this race locally by running the system_tests_http1_over_tcp.Http1OverTcpEdge2EdgeTest in a loop.

The fickle finger 'o git-bisect points to the culprit as this commit

That commit does re-arrange the codepath that exhibits this race, but I don't see it actually adding a race. I think that moving the code around a probably the reason TSAN has started complaining.

Still OK with suppression.

@ganeshmurthy
Copy link
Contributor

ganeshmurthy commented Sep 18, 2023

because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag

Yes. Say qdr_delivery_anycast_propagate_CT does indeed update the settled=true right after the I/O thread calls delivery_update_handler with settled=false, the core thread will eventually wake the connection and the I/O thread will again loop thru the updated_deliveries and call the delivery_update_handler with the correct value for settled (true), so this should be fine.

I am ok with this suppression.

@kgiusti
Copy link
Contributor Author

kgiusti commented Sep 18, 2023

Arg - sorry I failed to notice the real reason this race is now showing up.

We changed the name of the function.

The old function name is used in the suppression file! We've already suppressed this error awhile back. My bad.

kgiusti added a commit to kgiusti/skupper-router that referenced this issue Sep 18, 2023
@ganeshmurthy ganeshmurthy added this to the 2.5.0 milestone Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants