-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently #40431
Comments
cc @kou PTAL this issue? |
Can you provide a buildable C++ code that reproduces this problem? |
Of course. #include <arrow/compute/exec.h>
#include <arrow/compute/util.h>
#include <arrow/testing/gtest_util.h>
#include <arrow/testing/random.h>
#include <arrow/type_fwd.h>
#include <arrow/compute/light_array.h>
#include <arrow/compute/key_hash.h>
#include <arrow/util/async_util.h>
#include <arrow/util/future.h>
#include <arrow/util/task_group.h>
#include <arrow/util/thread_pool.h>
#include <arrow/util/logging.h>
#include <arrow/acero/options.h>
#include <arrow/compute/api_vector.h>
#include <arrow/memory_pool.h>
#include <arrow/record_batch.h>
#include <arrow/builder.h>
#include <arrow/result.h>
#include <arrow/array/diff.h>
#include <mutex>
#include <thread>
#include <unordered_map>
#include "gtest/gtest.h"
TEST(HashBatch, BasicTest) {
auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
const int batch_len = arr->length();
arrow::compute::ExecBatch exec_batch({arr}, batch_len);
auto ctx = arrow::compute::default_exec_context();
arrow::util::TempVectorStack stack;
ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)));
std::vector<uint32_t> hashes(batch_len);
std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
ASSERT_OK(arrow::compute::Hashing32::HashBatch(
exec_batch, hashes.data(), temp_column_arrays,
ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));
for (int i = 0; i < batch_len; i++) {
std::cout << hashes[i] << " ";
}
} cc @kou, do you think this problem needs to be solved? |
Sorry. I missed this. Thanks. I could run the code: diff --git a/cpp/src/arrow/compute/key_hash_test.cc b/cpp/src/arrow/compute/key_hash_test.cc
index c998df7169..ccfddaa645 100644
--- a/cpp/src/arrow/compute/key_hash_test.cc
+++ b/cpp/src/arrow/compute/key_hash_test.cc
@@ -311,5 +311,24 @@ TEST(VectorHash, FixedLengthTailByteSafety) {
HashFixedLengthFrom(/*key_length=*/19, /*num_rows=*/64, /*start_row=*/63);
}
+TEST(HashBatch, BasicTest) {
+ auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
+ const int batch_len = arr->length();
+ arrow::compute::ExecBatch exec_batch({arr}, batch_len);
+ auto ctx = arrow::compute::default_exec_context();
+ arrow::util::TempVectorStack stack;
+ ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)));
+
+ std::vector<uint32_t> hashes(batch_len);
+ std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
+ ASSERT_OK(arrow::compute::Hashing32::HashBatch(
+ exec_batch, hashes.data(), temp_column_arrays,
+ ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));
+
+ for (int i = 0; i < batch_len; i++) {
+ std::cout << hashes[i] << " ";
+ }
+}
+
} // namespace compute
} // namespace arrow In general, allocating only required size is preferred. So using the But it seems that arrow/cpp/src/arrow/compute/key_hash.cc Lines 387 to 395 in 605f8a7
And we also need 16 bytes metadata: arrow/cpp/src/arrow/compute/util.cc Lines 35 to 44 in 605f8a7
Anyway, could you try this? |
…ternal for prevent using by users (#40484) ### Rationale for this change These files expose implementation details and APIs that are not meant for third-party use. This PR explicitly marks them internal, which also avoids having them installed. ### Are these changes tested? By existing builds and tests. ### Are there any user-facing changes? No, except hiding some header files that were not supposed to be included externally. * GitHub Issue: #40431 Lead-authored-by: ZhangHuiGui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Issue resolved by pull request 40484 |
… to internal for prevent using by users (apache#40484) ### Rationale for this change These files expose implementation details and APIs that are not meant for third-party use. This PR explicitly marks them internal, which also avoids having them installed. ### Are these changes tested? By existing builds and tests. ### Are there any user-facing changes? No, except hiding some header files that were not supposed to be included externally. * GitHub Issue: apache#40431 Lead-authored-by: ZhangHuiGui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
… to internal for prevent using by users (apache#40484) ### Rationale for this change These files expose implementation details and APIs that are not meant for third-party use. This PR explicitly marks them internal, which also avoids having them installed. ### Are these changes tested? By existing builds and tests. ### Are there any user-facing changes? No, except hiding some header files that were not supposed to be included externally. * GitHub Issue: apache#40431 Lead-authored-by: ZhangHuiGui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
The issue is similar to #40007, but they are different.
I want to use the
Hashing32::HashBatch
api for produce a hash-array for a batch. Although theHashing32
andHashing64
are used in join based codes, but they can be used independently.Like below codes:
The crash stack in
HashBatch
is:The reason is blow codes:
arrow/cpp/src/arrow/compute/key_hash.cc
Lines 385 to 387 in 7e286dd
The holder use the
max_batch_size
which is1024
as it's num_elements, it's far more than the temp stack's initbuffer_size
.I know that the
HashBatch
is only used in hash-join or related codes. For join, they have already done line clipping at the upper level, ensuring that each input batch size is less_equal tokMiniBatchLength
and the stack size is bigger enough.But it can be used independently. So maybe we could use the
num_rows
rather thanutil::MiniBatch::kMiniBatchLength
inHashBatch
related apis?Component(s)
C++
The text was updated successfully, but these errors were encountered: