GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

stenlarsson · 2024-02-08T19:50:51Z

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

This PR contains a "Critical Fix".

Closes: [C++][Acero] Random hangs when joining tables with ExecutePlan #39582

github-actions · 2024-02-08T19:51:22Z

⚠️ GitHub issue #39582 has been automatically assigned in GitHub to PR creator.

westonpace · 2024-02-09T22:27:27Z

Hmm, this temp stack is only (I think) used in the hash-join. It's basically a stack allocator. Allocations made on the allocator should be RAII guarded to release when they finish. So either there is a bug and these are leaking somehow (I feel like this would be more reproducible) or maybe some of these stack variables are being help across a future boundary and that's causing some re-entrancy which causes the stack to run out of space (I like this explanation because it seems like it's environment specific and threading issues can often be environment statistic). Or, maybe it's only caused by certain input data?

Either way, this stack doesn't take up very much memory in the grand scheme of things. I don't think there is much harm in increasing this value.

westonpace

I'll approve this. @kou can feel free to merge if he wants to go ahead with this to unblock ruby.

If we can come up with a reproducible case I can investigate further but I probably won't have time to dedicate to trying to trigger this anytime soon.

kou · 2024-02-10T12:08:23Z

Thanks for reviewing this!

OK. Let's merge this for now. Let's investigating this later. (I hope that this can be reproducible on my environment...)

@stenlarsson Could you update the PR description to describe this problem more deeper like the associated issue, this is not a real fix and we should investigate/fix this later before we merge this? We will use the PR title and description for commit message.

pitrou · 2024-02-12T16:23:35Z

Regardless of this, can we please check for stack overflows instead of letting them hang the process?

kou · 2024-02-13T00:54:54Z

Ah, it's a good idea.
We can work on it in this PR or a separated PR.

stenlarsson · 2024-02-13T07:02:04Z

I have updated the description.

It is unfortunately that you cannot reproduce the issue, because in a debug build the assertion fails reliably every time on my computer.

pitrou · 2024-02-13T08:58:21Z

We can work on it in this PR or a separated PR.

Let's do it in this PR?

zanmato1984 · 2024-02-14T11:44:43Z

Another issue of crash caused by this stack overflow has been reported in #39951.

stenlarsson · 2024-02-16T06:19:30Z

Regardless of this, can we please check for stack overflows instead of letting them hang the process?

What exactly should happen in case of a stack overflow?

pitrou · 2024-02-16T08:53:00Z

A regular error if possible, or at least a controlled abort rather than memory corruption.

stenlarsson · 2024-02-22T15:13:09Z

Returning an error status is not an option since the TempVectorStack::alloc function is used in the TempVectorHolder constructor, so I decided to throw an std::runtime_error exception. I couldn't find any examples of throwing exceptions in the compute module, so please let me know if there is a better way abort.

pitrou · 2024-02-22T15:29:28Z

You could either create a static constructor:

template <typename T>
class TempVectorHolder {
  friend class TempVectorStack;

 public:
  static Result<TempVectorHolder> Make(TempVectorStack* stack, uint32_t num_elements) {
    TempVectorHolder holder{stack, nullptr, 0, num_elements};
    ARROW_RETURN_NOT_OK(stack->alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
    return holder;
  }

or, conversely, move the typed allocation API into TempVectorStack:

class ARROW_EXPORT TempVectorStack {
  template <typename>
  friend class TempVectorHolder;

 public:
  template <typename T>
  Result<TempVectorHolder<T>> AllocateVector(uint32_t num_elements) {
    TempVectorHolder holder{this, nullptr, 0, num_elements};
    ARROW_RETURN_NOT_OK(alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
    return holder;
  }

stenlarsson · 2024-02-22T15:47:48Z

Can you really return a TempVectorHolder like that? I think that is beyond my limited C++ skills unfortunately.

pitrou · 2024-02-22T15:50:05Z

It's probably possible, yes. It's just a bunch of pointers and integers.

stenlarsson · 2024-02-22T15:59:21Z

If I understand this correctly, the purpose of the TempVectorHolder is to release the memory in the destructor, as if the vector was allocated on the stack. How can you return such an object?

pitrou · 2024-02-22T16:11:16Z

By defining a move constructor and assignment operator, like this:

template <typename T>
class TempVectorHolder {
  friend class TempVectorStack;

 public:
  ~TempVectorHolder() {
    if (stack_) {
      stack_->release(id_, num_elements_ * sizeof(T));
    }
  }
  TempVectorHolder& operator=(TempVectorHolder&& other) {
    stack_ = other.stack_;
    other.stack_ = NULLPTR;
    data_ = other.data_;
    other.data_ = NULLPTR;
    id_ = other.id_;
    num_elements_ = other.num_elements_;
    return *this;
  }
  TempVectorHolder(TempVectorHolder&& other) {
    *this = std::move(other);
  }

  T* mutable_data() { return reinterpret_cast<T*>(data_); }

 private:
  TempVectorStack* stack_ = NULLPTR;
  uint8_t* data_;
  int id_;
  uint32_t num_elements_;
};

stenlarsson · 2024-02-23T10:38:48Z

I tried to implement this, but it is unfortunately beyond my abilities. If the TempVectorHolder returns a status, so does every method using it, and all methods using those methods, and so on. It is a huge change, and I got lost along the way.

Raising an exception will have to do.

Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.

pitrou · 2024-02-26T15:57:39Z

Ok, fair enough. I've now turned the exception into a regular check.

pitrou · 2024-02-26T15:58:13Z

@github-actions crossbow submit -g cpp

github-actions · 2024-02-26T16:00:38Z

Revision: ec3fd3b

Submitted crossbow builds: ursacomputing/crossbow @ actions-fe3111b10f

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading

zanmato1984 · 2024-02-26T16:20:01Z

I'd post a non-binding +1.

pitrou · 2024-02-26T16:25:51Z

Thank you @zanmato1984 !

conbench-apache-arrow · 2024-02-27T01:26:12Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9a7662b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

stenlarsson requested a review from westonpace as a code owner February 8, 2024 19:50

stenlarsson mentioned this pull request Feb 8, 2024

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

Closed

github-actions bot added Component: C++ awaiting review Awaiting review labels Feb 8, 2024

westonpace approved these changes Feb 9, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 9, 2024

zanmato1984 mentioned this pull request Feb 14, 2024

[C++] Segmentation fault in hash-join/swiss-join #39951

Closed

stenlarsson and others added 3 commits February 26, 2024 16:48

Throw error on overflow in TempVectorStack::alloc

33e353d

Turn exception into a controlled abort with error message

ec3fd3b

pitrou force-pushed the main branch from 3f5a13c to ec3fd3b Compare February 26, 2024 15:57

pitrou approved these changes Feb 26, 2024

View reviewed changes

pitrou merged commit 9a7662b into apache:main Feb 26, 2024
33 of 35 checks passed

pitrou removed the awaiting merge Awaiting merge label Feb 26, 2024

ZhangHuiGui mentioned this pull request Mar 9, 2024

[C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently #40431

Closed

zanmato1984 mentioned this pull request Apr 22, 2024

[C++][Acero] Acero's shared (per-thread) temp vector stack usage may overflow #41334

Closed

zanmato1984 mentioned this pull request May 1, 2024

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

stenlarsson commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024

westonpace commented Feb 9, 2024

westonpace left a comment

kou commented Feb 10, 2024

pitrou commented Feb 12, 2024

kou commented Feb 13, 2024

stenlarsson commented Feb 13, 2024

pitrou commented Feb 13, 2024

zanmato1984 commented Feb 14, 2024

stenlarsson commented Feb 16, 2024

pitrou commented Feb 16, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 23, 2024

pitrou commented Feb 26, 2024

pitrou commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

conbench-apache-arrow bot commented Feb 27, 2024

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

Conversation

stenlarsson commented Feb 8, 2024 • edited Loading

github-actions bot commented Feb 8, 2024

westonpace commented Feb 9, 2024

westonpace left a comment

Choose a reason for hiding this comment

kou commented Feb 10, 2024

pitrou commented Feb 12, 2024

kou commented Feb 13, 2024

stenlarsson commented Feb 13, 2024

pitrou commented Feb 13, 2024

zanmato1984 commented Feb 14, 2024

stenlarsson commented Feb 16, 2024

pitrou commented Feb 16, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 22, 2024

pitrou commented Feb 22, 2024

stenlarsson commented Feb 23, 2024

pitrou commented Feb 26, 2024

pitrou commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

zanmato1984 commented Feb 26, 2024

pitrou commented Feb 26, 2024

conbench-apache-arrow bot commented Feb 27, 2024

stenlarsson commented Feb 8, 2024 •

edited

Loading