-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39582: [C++][Acero] Increase size of Acero TempStack #40007
Conversation
|
Hmm, this temp stack is only (I think) used in the hash-join. It's basically a stack allocator. Allocations made on the allocator should be RAII guarded to release when they finish. So either there is a bug and these are leaking somehow (I feel like this would be more reproducible) or maybe some of these stack variables are being help across a future boundary and that's causing some re-entrancy which causes the stack to run out of space (I like this explanation because it seems like it's environment specific and threading issues can often be environment statistic). Or, maybe it's only caused by certain input data? Either way, this stack doesn't take up very much memory in the grand scheme of things. I don't think there is much harm in increasing this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll approve this. @kou can feel free to merge if he wants to go ahead with this to unblock ruby.
If we can come up with a reproducible case I can investigate further but I probably won't have time to dedicate to trying to trigger this anytime soon.
Thanks for reviewing this! OK. Let's merge this for now. Let's investigating this later. (I hope that this can be reproducible on my environment...) @stenlarsson Could you update the PR description to describe this problem more deeper like the associated issue, this is not a real fix and we should investigate/fix this later before we merge this? We will use the PR title and description for commit message. |
Regardless of this, can we please check for stack overflows instead of letting them hang the process? |
Ah, it's a good idea. |
I have updated the description. It is unfortunately that you cannot reproduce the issue, because in a debug build the assertion fails reliably every time on my computer. |
Let's do it in this PR? |
Another issue of crash caused by this stack overflow has been reported in #39951. |
What exactly should happen in case of a stack overflow? |
A regular error if possible, or at least a controlled abort rather than memory corruption. |
Returning an error status is not an option since the |
You could either create a static constructor: template <typename T>
class TempVectorHolder {
friend class TempVectorStack;
public:
static Result<TempVectorHolder> Make(TempVectorStack* stack, uint32_t num_elements) {
TempVectorHolder holder{stack, nullptr, 0, num_elements};
ARROW_RETURN_NOT_OK(stack->alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
return holder;
} or, conversely, move the typed allocation API into TempVectorStack: class ARROW_EXPORT TempVectorStack {
template <typename>
friend class TempVectorHolder;
public:
template <typename T>
Result<TempVectorHolder<T>> AllocateVector(uint32_t num_elements) {
TempVectorHolder holder{this, nullptr, 0, num_elements};
ARROW_RETURN_NOT_OK(alloc(num_elements * sizeof(T), &holder.data_, &holder.id_));
return holder;
} |
Can you really return a |
It's probably possible, yes. It's just a bunch of pointers and integers. |
If I understand this correctly, the purpose of the TempVectorHolder is to release the memory in the destructor, as if the vector was allocated on the stack. How can you return such an object? |
By defining a move constructor and assignment operator, like this: template <typename T>
class TempVectorHolder {
friend class TempVectorStack;
public:
~TempVectorHolder() {
if (stack_) {
stack_->release(id_, num_elements_ * sizeof(T));
}
}
TempVectorHolder& operator=(TempVectorHolder&& other) {
stack_ = other.stack_;
other.stack_ = NULLPTR;
data_ = other.data_;
other.data_ = NULLPTR;
id_ = other.id_;
num_elements_ = other.num_elements_;
return *this;
}
TempVectorHolder(TempVectorHolder&& other) {
*this = std::move(other);
}
T* mutable_data() { return reinterpret_cast<T*>(data_); }
private:
TempVectorStack* stack_ = NULLPTR;
uint8_t* data_;
int id_;
uint32_t num_elements_;
}; |
I tried to implement this, but it is unfortunately beyond my abilities. If the TempVectorHolder returns a status, so does every method using it, and all methods using those methods, and so on. It is a huge change, and I got lost along the way. Raising an exception will have to do. |
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.
Ok, fair enough. I've now turned the exception into a regular check. |
@github-actions crossbow submit -g cpp |
Revision: ec3fd3b Submitted crossbow builds: ursacomputing/crossbow @ actions-fe3111b10f |
I'd post a non-binding +1. |
Thank you @zanmato1984 ! |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9a7662b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.
However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.
This PR contains a "Critical Fix".