Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

Merged
merged 3 commits into from
Feb 26, 2024

Conversation

stenlarsson
Copy link
Contributor

@stenlarsson stenlarsson commented Feb 8, 2024

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

This PR contains a "Critical Fix".

Copy link

github-actions bot commented Feb 8, 2024

⚠️ GitHub issue #39582 has been automatically assigned in GitHub to PR creator.

@westonpace
Copy link
Member

Hmm, this temp stack is only (I think) used in the hash-join. It's basically a stack allocator. Allocations made on the allocator should be RAII guarded to release when they finish. So either there is a bug and these are leaking somehow (I feel like this would be more reproducible) or maybe some of these stack variables are being help across a future boundary and that's causing some re-entrancy which causes the stack to run out of space (I like this explanation because it seems like it's environment specific and threading issues can often be environment statistic). Or, maybe it's only caused by certain input data?

Either way, this stack doesn't take up very much memory in the grand scheme of things. I don't think there is much harm in increasing this value.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve this. @kou can feel free to merge if he wants to go ahead with this to unblock ruby.

If we can come up with a reproducible case I can investigate further but I probably won't have time to dedicate to trying to trigger this anytime soon.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 9, 2024
@kou
Copy link
Member

kou commented Feb 10, 2024

Thanks for reviewing this!

OK. Let's merge this for now. Let's investigating this later. (I hope that this can be reproducible on my environment...)

@stenlarsson Could you update the PR description to describe this problem more deeper like the associated issue, this is not a real fix and we should investigate/fix this later before we merge this? We will use the PR title and description for commit message.

@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

Regardless of this, can we please check for stack overflows instead of letting them hang the process?

@kou
Copy link
Member

kou commented Feb 13, 2024

Ah, it's a good idea.
We can work on it in this PR or a separated PR.

@stenlarsson
Copy link
Contributor Author

I have updated the description.

It is unfortunately that you cannot reproduce the issue, because in a debug build the assertion fails reliably every time on my computer.

@pitrou
Copy link
Member

pitrou commented Feb 13, 2024

We can work on it in this PR or a separated PR.

Let's do it in this PR?

@zanmato1984
Copy link
Contributor

Another issue of crash caused by this stack overflow has been reported in #39951.

@stenlarsson
Copy link
Contributor Author

Regardless of this, can we please check for stack overflows instead of letting them hang the process?

What exactly should happen in case of a stack overflow?

@pitrou
Copy link
Member

pitrou commented Feb 16, 2024

A regular error if possible, or at least a controlled abort rather than memory corruption.

@stenlarsson
Copy link
Contributor Author

Returning an error status is not an option since the TempVectorStack::alloc function is used in the TempVectorHolder constructor, so I decided to throw an std::runtime_error exception. I couldn't find any examples of throwing exceptions in the compute module, so please let me know if there is a better way abort.

@pitrou
Copy link
Member

pitrou commented Feb 22, 2024

You could either create a static constructor:

template <typename T>
class TempVectorHolder {
  friend class TempVectorStack;

 public:
  static Result<TempVectorHolder> Make(TempVectorStack* stack, uint32_t num_elements) {
    TempVectorHolder holder{stack, nullptr, 0, num_elements};
    ARROW_RETURN_NOT_OK(stack->alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
    return holder;
  }

or, conversely, move the typed allocation API into TempVectorStack:

class ARROW_EXPORT TempVectorStack {
  template <typename>
  friend class TempVectorHolder;

 public:
  template <typename T>
  Result<TempVectorHolder<T>> AllocateVector(uint32_t num_elements) {
    TempVectorHolder holder{this, nullptr, 0, num_elements};
    ARROW_RETURN_NOT_OK(alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
    return holder;
  }

@stenlarsson
Copy link
Contributor Author

Can you really return a TempVectorHolder like that? I think that is beyond my limited C++ skills unfortunately.

@pitrou
Copy link
Member

pitrou commented Feb 22, 2024

It's probably possible, yes. It's just a bunch of pointers and integers.

@stenlarsson
Copy link
Contributor Author

If I understand this correctly, the purpose of the TempVectorHolder is to release the memory in the destructor, as if the vector was allocated on the stack. How can you return such an object?

@pitrou
Copy link
Member

pitrou commented Feb 22, 2024

By defining a move constructor and assignment operator, like this:

template <typename T>
class TempVectorHolder {
  friend class TempVectorStack;

 public:
  ~TempVectorHolder() {
    if (stack_) {
      stack_->release(id_, num_elements_ * sizeof(T));
    }
  }
  TempVectorHolder& operator=(TempVectorHolder&& other) {
    stack_ = other.stack_;
    other.stack_ = NULLPTR;
    data_ = other.data_;
    other.data_ = NULLPTR;
    id_ = other.id_;
    num_elements_ = other.num_elements_;
    return *this;
  }
  TempVectorHolder(TempVectorHolder&& other) {
    *this = std::move(other);
  }

  T* mutable_data() { return reinterpret_cast<T*>(data_); }

 private:
  TempVectorStack* stack_ = NULLPTR;
  uint8_t* data_;
  int id_;
  uint32_t num_elements_;
};

@stenlarsson
Copy link
Contributor Author

I tried to implement this, but it is unfortunately beyond my abilities. If the TempVectorHolder returns a status, so does every method using it, and all methods using those methods, and so on. It is a huge change, and I got lost along the way.

Raising an exception will have to do.

stenlarsson and others added 3 commits February 26, 2024 16:48
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it.

Fixes apache#39582.
@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

Ok, fair enough. I've now turned the exception into a regular check.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: ec3fd3b

Submitted crossbow builds: ursacomputing/crossbow @ actions-fe3111b10f

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

@zanmato1984
Copy link
Contributor

I'd post a non-binding +1.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

Thank you @zanmato1984 !

@pitrou pitrou merged commit 9a7662b into apache:main Feb 26, 2024
33 of 35 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Feb 26, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9a7662b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…#40007)

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: apache#39582

Lead-authored-by: Sten Larsson <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…#40007)

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: apache#39582

Lead-authored-by: Sten Larsson <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this pull request Mar 13, 2024
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: #39582

Lead-authored-by: Sten Larsson <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Acero] Random hangs when joining tables with ExecutePlan
5 participants