GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node #37839

JerAguilon · 2023-09-23T02:02:44Z

Rationale for this change

What changes are included in this PR?

While asofjoining some large parquet datasets with many row groups, I ran into a deadlock that I described here: #37796. Copy pasting below for convenience:

The left hand side of the asofjoin completes and is matched with the right hand tables, so InputFinished proceeds as expected. So far so good
The right hand table(s) of the join are a huge dataset scan. They're still streaming and can legally still call AsofJoinNode::InputReceived all they want (doc ref)
Each input batch is blindly pushed to the InputStates, which in turn defer to BackpressureHandlers to decide whether to pause inputs. (code pointer)
If enough batches come in right after EndFromProcessThread is called, then we might exceed the high_threshold and tell the input node to pause via the BackpressureController
At this point, the process thread has stopped for the asofjoiner, so the right hand table(s) won't be dequeue'd, meaning BackpressureController::Resume() will never be called. This causes a deadlock

TLDR this is caused by a straggling input node being paused due to backpressure after the process thread has ended. And since every PauseInput needs a corresponding ResumeInput to exit gracefully, we deadlock.

Turns out this is fairly easy to reproduce with small tables, if you make a slow input node composed of 1-row record batches with a synthetic delay.

My solution is to:

Create a ForceShutdown hook that puts the input nodes in a resumed state, and for good measure we call StopProducing
Also for good measure, if nodes come after the process thread exits, we short circuit and return OK. This is because InputReceived can be called an arbitrary number of times after StopProducing, so it makes sense to not enqueue useless batches.

Are these changes tested?

Yes, I added a delay to the batches of one of the already-existing asofjoin backpressure tests. Checkout out main, we get a timeout failure. With my changes, it passes.

I considered a more deterministic test, but I struggled to create callbacks in a way that wasn't invasive to the Asof implementation. The idea of using delays was inspired by things I saw in source_node_test.cc

JerAguilon · 2023-09-23T02:04:32Z

cpp/src/arrow/acero/asof_join_node.cc

+    // It may be unintuitive to call Resume() here, but this is to avoid a deadlock.
+    // Since acero's executor won't terminate if any one node is paused, we need to
+    // force resume the node before stopping production.
+    backpressure_control_->Resume();


Perhaps one thing to clarify is whether ResumeInput behaves idempotently? I.e., is it OK to always call resume, even though only some inputs hit this PauseInput race condition?

My perusal of source_node.cc tells me this is OK, but LMK if this is a poor assumption to make.

JerAguilon · 2023-09-25T14:13:05Z

cpp/src/arrow/acero/asof_join_node.cc

@@ -19,6 +19,7 @@

 #include <atomic>
 #include <condition_variable>
+#include <iostream>


Oops... Will remove

westonpace

This seems like a good idea to me.

@icexelloss @rtpsw do either of you want to take a look?

westonpace · 2023-09-25T18:46:11Z

cpp/src/arrow/acero/asof_join_node_test.cc

-
-    src_decls.emplace_back("source",
-                           SourceNodeOptions(config.schema, GetGen(config.batches)));
+    if (config.is_delayed) {


I assume this new option triggers the deadlock on the unfixed code?

westonpace · 2023-09-25T18:49:59Z

cpp/src/arrow/acero/asof_join_node.cc

+    // Since acero's executor won't terminate if any one node is paused, we need to
+    // force resume the node before stopping production.
+    backpressure_control_->Resume();
+    return input_->StopProducing();


So if I understand correctly this means we will call StopProducing on all right hand side nodes once:

The left hand side has finished

The right hand side has caught up

If so, then I agree this is a valid thing to do.

Yep.

As an aside, I feel like a more invasive change could fix this issue in the general case. If a node (in this example asof join) has:

Called output->InputFinished() AND

Called output_->InputReceived for however many record batches it advertised on InputFinished

We should be able to shut down execution, even if the node's inputs:

are paused or

not done streaming

haven't called InputFinished

But I think this is a more invasive change to exec_plan.h and might have some hairy issues that I'm not thinking of.

icexelloss · 2023-09-27T14:42:39Z

This looks reasonable to me. Free feel to merge.

cpp/src/arrow/acero/asof_join_node.cc

bkietz · 2023-09-27T21:05:28Z

cpp/src/arrow/acero/asof_join_node.cc

+    // InputReceived may be called after execution was finished. Pushing it to the
+    // InputState may cause the BackPressureController to pause the input, causing a
+    // deadlock


Suggested change

// InputReceived may be called after execution was finished. Pushing it to the

// InputState may cause the BackPressureController to pause the input, causing a

// deadlock

// InputReceived may be called after execution was finished. Pushing it to the

// InputState is unnecessary since we're done (and anyway may cause the

// BackPressureController to pause the input, causing a deadlock), so drop it.

Do we still deadlock with this short circuit but without ForceShutdown etc?

Yes, the forceShutdown is still necessary. there's nothing stopping this order of events:

We receive enough data to finish the as of join.

Right before we finish processing and shut down the worker thread, lots of unneeded batches come in from input A. Input A pauses

We shut down the thread, and input A can't be unpaused

Put another way, forceShutdown keeps us from deadlocking when we ingest unneeded data before the worker thread exits. And this block keeps us from deadlocking when we ingest unneeded data after the worker thread exits.

But your comment change suggestions sound good to me

JerAguilon · 2023-10-01T05:29:13Z

Clarifying comment for @bkietz added. Ready for more thoughts

JerAguilon · 2023-10-20T20:08:08Z

Forgive the ignorance - first time making a PR on arrow. There's no further action needed from me to merge, correct?

bkietz · 2023-10-23T13:22:17Z

Could you rebase to pick up the fix #37867 ? I think CI should be green after that

Co-authored-by: Benjamin Kietzman <[email protected]>

JerAguilon · 2023-10-23T18:48:56Z

Sadly there are some seemingly unrelated failures: TestS3FS.GetFileInfoGeneratorStress and arrow-threading-utility-test

bkietz · 2023-10-24T13:18:10Z

CI failures seem unrelated. I'll merge. Thanks for working on this!

conbench-apache-arrow · 2023-10-25T12:56:22Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit e3d6b9b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…input in the as-of-join node (apache#37839) While asofjoining some large parquet datasets with many row groups, I ran into a deadlock that I described here: apache#37796. Copy pasting below for convenience: 1. The left hand side of the asofjoin completes and is matched with the right hand tables, so `InputFinished` proceeds as [expected](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1323). So far so good 2. The right hand table(s) of the join are a huge dataset scan. They're still streaming and can legally still call `AsofJoinNode::InputReceived` all they want ([doc ref](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero8ExecNode13InputReceivedEP8ExecNode9ExecBatch)) 3. Each input batch is blindly pushed to the `InputState`s, which in turn defer to `BackpressureHandler`s to decide whether to pause inputs. ([code pointer](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1689)) 4. If enough batches come in right after `EndFromProcessThread` is called, then we might exceed the [high_threshold](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L575) and tell the input node to pause via the [BackpressureController](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L540) 5. At this point, the process thread has stopped for the asofjoiner, so the right hand table(s) won't be dequeue'd, meaning `BackpressureController::Resume()` will never be called. This causes a [deadlock](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero19BackpressureControl5PauseEv) TLDR this is caused by a straggling input node being paused due to backpressure _after_ the process thread has ended. And since every `PauseInput` needs a corresponding `ResumeInput` to exit gracefully, we deadlock. Turns out this is fairly easy to reproduce with small tables, if you make a slow input node composed of 1-row record batches with a synthetic delay. My solution is to: 1. Create a `ForceShutdown` hook that puts the input nodes in a resumed state, and for good measure we call `StopProducing` 2. Also for good measure, if nodes come after the process thread exits, we short circuit and return OK. This is because `InputReceived` can be called an arbitrary number of times after `StopProducing`, so it makes sense to not enqueue useless batches. Yes, I added a delay to the batches of one of the already-existing asofjoin backpressure tests. Checkout out `main`, we get a timeout failure. With my changes, it passes. I considered a more deterministic test, but I struggled to create callbacks in a way that wasn't invasive to the Asof implementation. The idea of using delays was inspired by things I saw in `source_node_test.cc` <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code No * Closes: apache#37796 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…input in the as-of-join node (apache#37839) ### Rationale for this change ### What changes are included in this PR? While asofjoining some large parquet datasets with many row groups, I ran into a deadlock that I described here: apache#37796. Copy pasting below for convenience: 1. The left hand side of the asofjoin completes and is matched with the right hand tables, so `InputFinished` proceeds as [expected](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1323). So far so good 2. The right hand table(s) of the join are a huge dataset scan. They're still streaming and can legally still call `AsofJoinNode::InputReceived` all they want ([doc ref](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero8ExecNode13InputReceivedEP8ExecNode9ExecBatch)) 3. Each input batch is blindly pushed to the `InputState`s, which in turn defer to `BackpressureHandler`s to decide whether to pause inputs. ([code pointer](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1689)) 4. If enough batches come in right after `EndFromProcessThread` is called, then we might exceed the [high_threshold](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L575) and tell the input node to pause via the [BackpressureController](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L540) 5. At this point, the process thread has stopped for the asofjoiner, so the right hand table(s) won't be dequeue'd, meaning `BackpressureController::Resume()` will never be called. This causes a [deadlock](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero19BackpressureControl5PauseEv) TLDR this is caused by a straggling input node being paused due to backpressure _after_ the process thread has ended. And since every `PauseInput` needs a corresponding `ResumeInput` to exit gracefully, we deadlock. Turns out this is fairly easy to reproduce with small tables, if you make a slow input node composed of 1-row record batches with a synthetic delay. My solution is to: 1. Create a `ForceShutdown` hook that puts the input nodes in a resumed state, and for good measure we call `StopProducing` 2. Also for good measure, if nodes come after the process thread exits, we short circuit and return OK. This is because `InputReceived` can be called an arbitrary number of times after `StopProducing`, so it makes sense to not enqueue useless batches. ### Are these changes tested? Yes, I added a delay to the batches of one of the already-existing asofjoin backpressure tests. Checkout out `main`, we get a timeout failure. With my changes, it passes. I considered a more deterministic test, but I struggled to create callbacks in a way that wasn't invasive to the Asof implementation. The idea of using delays was inspired by things I saw in `source_node_test.cc` <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code ### Are there any user-facing changes? No * Closes: apache#37796 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

raulcd · 2024-01-14T14:59:43Z

Closes: #37796

…input in the as-of-join node (apache#37839) ### Rationale for this change ### What changes are included in this PR? While asofjoining some large parquet datasets with many row groups, I ran into a deadlock that I described here: apache#37796. Copy pasting below for convenience: 1. The left hand side of the asofjoin completes and is matched with the right hand tables, so `InputFinished` proceeds as [expected](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1323). So far so good 2. The right hand table(s) of the join are a huge dataset scan. They're still streaming and can legally still call `AsofJoinNode::InputReceived` all they want ([doc ref](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero8ExecNode13InputReceivedEP8ExecNode9ExecBatch)) 3. Each input batch is blindly pushed to the `InputState`s, which in turn defer to `BackpressureHandler`s to decide whether to pause inputs. ([code pointer](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1689)) 4. If enough batches come in right after `EndFromProcessThread` is called, then we might exceed the [high_threshold](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L575) and tell the input node to pause via the [BackpressureController](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L540) 5. At this point, the process thread has stopped for the asofjoiner, so the right hand table(s) won't be dequeue'd, meaning `BackpressureController::Resume()` will never be called. This causes a [deadlock](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero19BackpressureControl5PauseEv) TLDR this is caused by a straggling input node being paused due to backpressure _after_ the process thread has ended. And since every `PauseInput` needs a corresponding `ResumeInput` to exit gracefully, we deadlock. Turns out this is fairly easy to reproduce with small tables, if you make a slow input node composed of 1-row record batches with a synthetic delay. My solution is to: 1. Create a `ForceShutdown` hook that puts the input nodes in a resumed state, and for good measure we call `StopProducing` 2. Also for good measure, if nodes come after the process thread exits, we short circuit and return OK. This is because `InputReceived` can be called an arbitrary number of times after `StopProducing`, so it makes sense to not enqueue useless batches. ### Are these changes tested? Yes, I added a delay to the batches of one of the already-existing asofjoin backpressure tests. Checkout out `main`, we get a timeout failure. With my changes, it passes. I considered a more deterministic test, but I struggled to create callbacks in a way that wasn't invasive to the Asof implementation. The idea of using delays was inspired by things I saw in `source_node_test.cc` <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code ### Are there any user-facing changes? No * Closes: apache#37796 Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

JerAguilon requested a review from westonpace as a code owner September 23, 2023 02:02

JerAguilon changed the title ~~GH-37796: [C++][Acero] Fix race condition caused by straggling input.~~ GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node Sep 23, 2023

github-actions bot added the awaiting review Awaiting review label Sep 23, 2023

JerAguilon commented Sep 23, 2023

View reviewed changes

github-actions bot added the Component: C++ label Sep 23, 2023

JerAguilon commented Sep 25, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 25, 2023

westonpace approved these changes Sep 25, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Sep 25, 2023

bkietz requested changes Sep 27, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels Sep 27, 2023

bkietz approved these changes Oct 2, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 2, 2023

JerAguilon requested review from assignUser, kou, raulcd, paleolimbot, thisisnic, lidavidm and kevingurney as code owners October 23, 2023 14:59

Jeremy Aguilon and others added 6 commits October 23, 2023 12:58

.

d24b102

.

78c6d79

.

fb3a1f4

Update asof_join_node.cc

72ec897

Co-authored-by: Benjamin Kietzman <[email protected]>

Update asof_join_node.cc

ecfb84d

Co-authored-by: Benjamin Kietzman <[email protected]>

Update asof_join_node.cc

be5cb05

JerAguilon force-pushed the fix-asof branch from 5a225a9 to be5cb05 Compare October 23, 2023 16:58

github-actions bot removed Component: R Component: Java Component: Parquet Component: JavaScript Component: C# Component: Gandiva Component: MATLAB Component: Documentation labels Oct 23, 2023

trigger GitHub actions

529e3b5

bkietz merged commit e3d6b9b into apache:main Oct 24, 2023
32 of 33 checks passed

bkietz removed the awaiting merge Awaiting merge label Oct 24, 2023

JerAguilon deleted the fix-asof branch October 25, 2023 14:41

JerAguilon restored the fix-asof branch October 25, 2023 14:41

JerAguilon deleted the fix-asof branch October 25, 2023 14:41

JerAguilon restored the fix-asof branch October 25, 2023 14:42

trxcllnt removed their request for review November 8, 2023 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node #37839

GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node #37839

JerAguilon commented Sep 23, 2023 •

edited

Loading

JerAguilon Sep 23, 2023

JerAguilon Sep 25, 2023 •

edited

Loading

westonpace left a comment

westonpace Sep 25, 2023

JerAguilon Sep 26, 2023

westonpace Sep 25, 2023

JerAguilon Sep 26, 2023

icexelloss commented Sep 27, 2023

bkietz Sep 27, 2023

JerAguilon Oct 1, 2023 •

edited

Loading

JerAguilon commented Oct 1, 2023

JerAguilon commented Oct 20, 2023

bkietz commented Oct 23, 2023

JerAguilon commented Oct 23, 2023 •

edited

Loading

bkietz commented Oct 24, 2023

conbench-apache-arrow bot commented Oct 25, 2023

raulcd commented Jan 14, 2024

GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node #37839

GH-37796: [C++][Acero] Fix race condition caused by straggling input in the as-of-join node #37839

Conversation

JerAguilon commented Sep 23, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

JerAguilon Sep 23, 2023

Choose a reason for hiding this comment

JerAguilon Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Sep 25, 2023

Choose a reason for hiding this comment

JerAguilon Sep 26, 2023

Choose a reason for hiding this comment

westonpace Sep 25, 2023

Choose a reason for hiding this comment

JerAguilon Sep 26, 2023

Choose a reason for hiding this comment

icexelloss commented Sep 27, 2023

bkietz Sep 27, 2023

Choose a reason for hiding this comment

JerAguilon Oct 1, 2023 • edited Loading

Choose a reason for hiding this comment

JerAguilon commented Oct 1, 2023

JerAguilon commented Oct 20, 2023

bkietz commented Oct 23, 2023

JerAguilon commented Oct 23, 2023 • edited Loading

bkietz commented Oct 24, 2023

conbench-apache-arrow bot commented Oct 25, 2023

raulcd commented Jan 14, 2024

JerAguilon commented Sep 23, 2023 •

edited

Loading

JerAguilon Sep 25, 2023 •

edited

Loading

JerAguilon Oct 1, 2023 •

edited

Loading

JerAguilon commented Oct 23, 2023 •

edited

Loading