GH-37895: [C++] Feature: support concatenate recordbatches. #37896

Light-City · 2023-09-27T00:27:19Z

Rationale for this change

User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together.

What changes are included in this PR?

record_batch.cc
record_batch.h
record_batch_test.cc

Are these changes tested?

yes, see record_batch_test.cc

Are there any user-facing changes?

yes

Closes: [C++] support concatenate recordbatches. #37895

github-actions · 2023-09-27T00:27:49Z

⚠️ GitHub issue #37895 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/record_batch.h

cpp/src/arrow/record_batch.cc

bkietz · 2023-09-27T18:38:48Z

cpp/src/arrow/record_batch.cc

+  if (cols == 0) {
+    // special case: null batch, no data, just length
+    for (size_t i = 0; i < batches.size(); ++i) {
+      length += batches[i]->num_rows();
+    }
+  } else {


I don't think you need a special case for zero columns:

Suggested change

if (cols == 0) {

// special case: null batch, no data, just length

for (size_t i = 0; i < batches.size(); ++i) {

length += batches[i]->num_rows();

}

} else {

// special case: null batch, no data, just length

for (size_t i = 0; i < batches.size(); ++i) {

length += batches[i]->num_rows();

}

Then the loop for (int col = 0; col < cols; ++col) will just never be entered if cols == 0

no, My pseudocode is below, with the else branch

if (cols == 0) { // special case: null batch, no data, just length for (size_t i = 0; i < batches.size(); ++i) { length += batches[i]->num_rows(); } } else { for (int col = 0; col < cols; ++col) }

The reason why the special case of column 0 is processed here is that if we want to implement count(*), then this RecordBatch does not store data, that is, only the length. This scenario is triggered when small batches need to be merged.

Will the following change make any difference on the result?

Result<std::shared_ptr<RecordBatch>> ConcatenateRecordBatches( const RecordBatchVector& batches, MemoryPool* pool) { int64_t length = 0; size_t n = batches.size(); if (n == 0) { return Status::Invalid("Must pass at least one recordbatch"); } int cols = batches[0]->num_columns(); auto schema = batches[0]->schema(); for (int i = 0; i < batches.size(); ++i) { length += batches[i]->num_rows(); if (!schema->Equals(batches[i]->schema())) { return Status::Invalid( "Schema of RecordBatch index ", i, " is ", batches[i]->schema()->ToString(), ", which does not match index 0 recordbatch schema: ", schema->ToString()); } } std::vector<std::shared_ptr<Array>> concatenated_columns; for (int col = 0; col < cols; ++col) { ArrayVector column_arrays; for (const auto & batch : batches) { column_arrays.emplace_back(std::move(batch->column(col))); } ARROW_ASSIGN_OR_RAISE(auto concatenated_column, Concatenate(column_arrays, pool)) concatenated_columns.emplace_back(std::move(concatenated_column)); } return RecordBatch::Make(std::move(schema), length, std::move(concatenated_columns)); }

I assumed:

Schema checking can be moved out of the nested loop.

The result length is just the sum of all batch lengths, no matter how many (or zero) columns there are.

Each column has the same length in a RecordBatch, so no checking of column length is needed here.

Variable names like column_data, data, columns are ambiguous so I changed them to more descriptive names.

More data can be std::moved to avoid copy.

The code looks very refreshing, but it does not handle the case of cols = 0. In this case, cols = 0, length is not obtained.

For a batch, there may be cols = 0, the schema is empty, but there is length. This kind of batch is used to process count(*) and does not store the actual columns, only the number of rows.

for (int i = 0; i < batches.size(); ++i) { length += batches[i]->num_rows(); ... }

Doesn't this include the cols=0 case?

Sorry, you are correct, it is handled here

Thanks for the modification. This code modification has been submitted.
note:

column_arrays.emplace_back(std::move(batch->column(col)));

Change to

column_arrays.emplace_back(batch->column(col));

cpp/src/arrow/record_batch_test.cc

js8544 · 2023-10-13T10:15:09Z

cpp/src/arrow/record_batch.cc

+  if (cols == 0) {
+    // special case: null batch, no data, just length
+    for (size_t i = 0; i < batches.size(); ++i) {
+      length += batches[i]->num_rows();
+    }
+  } else {


Will the following change make any difference on the result?

Result<std::shared_ptr<RecordBatch>> ConcatenateRecordBatches( const RecordBatchVector& batches, MemoryPool* pool) { int64_t length = 0; size_t n = batches.size(); if (n == 0) { return Status::Invalid("Must pass at least one recordbatch"); } int cols = batches[0]->num_columns(); auto schema = batches[0]->schema(); for (int i = 0; i < batches.size(); ++i) { length += batches[i]->num_rows(); if (!schema->Equals(batches[i]->schema())) { return Status::Invalid( "Schema of RecordBatch index ", i, " is ", batches[i]->schema()->ToString(), ", which does not match index 0 recordbatch schema: ", schema->ToString()); } } std::vector<std::shared_ptr<Array>> concatenated_columns; for (int col = 0; col < cols; ++col) { ArrayVector column_arrays; for (const auto & batch : batches) { column_arrays.emplace_back(std::move(batch->column(col))); } ARROW_ASSIGN_OR_RAISE(auto concatenated_column, Concatenate(column_arrays, pool)) concatenated_columns.emplace_back(std::move(concatenated_column)); } return RecordBatch::Make(std::move(schema), length, std::move(concatenated_columns)); }

I assumed:

Schema checking can be moved out of the nested loop.

The result length is just the sum of all batch lengths, no matter how many (or zero) columns there are.

Each column has the same length in a RecordBatch, so no checking of column length is needed here.

Variable names like column_data, data, columns are ambiguous so I changed them to more descriptive names.

More data can be std::moved to avoid copy.

cpp/src/arrow/record_batch_test.cc

js8544

+1. Thanks! But it will need @bkietz's approval before merging.

js8544 · 2023-10-13T12:04:44Z

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

Light-City · 2023-10-13T12:29:44Z

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

+1, I am also very confused, mac and linux are compiled correctly

But Windows is


[345/901] Linking CXX executable release\arrow-table-test.exe
FAILED: release/arrow-table-test.exe 
cmd.exe /C "cd . && D:\a\_temp\msys64\clang64\bin\c++.exe -Qunused-arguments -fcolor-diagnostics  -Wa,-mbig-obj -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -mxsave -msse4.2  -O3 -DNDEBUG -O2  @CMakeFiles\arrow-table-test.rsp -o release\arrow-table-test.exe -Wl,--out-implib,release\libarrow-table-test.dll.a -Wl,--major-image-version,0,--minor-image-version,0  && cd ."
ld.lld: error: undefined symbol: arrow::ConcatenateRecordBatches(std::__1::vector<std::__1::shared_ptr<arrow::RecordBatch>, std::__1::allocator<std::__1::shared_ptr<arrow::RecordBatch>>> const&, arrow::MemoryPool*)
>>> referenced by src/arrow/CMakeFiles/arrow-table-test.dir/record_batch_test.cc.obj:(arrow::TestRecordBatch_ConcatenateRecordBatches_Test::TestBody())
>>> referenced by src/arrow/CMakeFiles/arrow-table-test.dir/record_batch_test.cc.obj:(arrow::TestRecordBatch_ConcatenateRecordBatches_Test::TestBody())

Light-City · 2023-10-13T12:37:37Z

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

Found that ARROW_EXPORT is missing, this should be the reason

cpp/src/arrow/record_batch.cc

cpp/src/arrow/record_batch_test.cc

cpp/src/arrow/record_batch.h

bkietz · 2023-10-13T17:56:49Z

CI failure seems unrelated
+1

conbench-apache-arrow · 2023-10-17T13:25:03Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c5bce96.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

…ache#37896) ### Rationale for this change User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together. ### What changes are included in this PR? record_batch.cc record_batch.h record_batch_test.cc ### Are these changes tested? yes, see record_batch_test.cc ### Are there any user-facing changes? yes * Closes: apache#37895 Authored-by: light-city <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

github-actions bot added Component: C++ awaiting review Awaiting review labels Sep 27, 2023

bkietz self-requested a review September 27, 2023 18:27

bkietz requested changes Sep 27, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 27, 2023

Light-City force-pushed the GH-37895 branch from 7bef38b to cd8cf5a Compare September 28, 2023 01:31

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 28, 2023

Light-City force-pushed the GH-37895 branch from 0bea160 to 0ebf614 Compare September 28, 2023 01:39

js8544 reviewed Oct 13, 2023

View reviewed changes

Feature: support concatenate record batches.

6203a72

Light-City force-pushed the GH-37895 branch from 1d43916 to 6203a72 Compare October 13, 2023 10:51

js8544 approved these changes Oct 13, 2023

View reviewed changes

add ARROW_EXPORT

35a414a

mapleFU reviewed Oct 13, 2023

View reviewed changes

cpp/src/arrow/record_batch.cc Show resolved Hide resolved

cpp/src/arrow/record_batch.cc Show resolved Hide resolved

cpp/src/arrow/record_batch_test.cc Show resolved Hide resolved

mapleFU approved these changes Oct 13, 2023

View reviewed changes

mapleFU reviewed Oct 13, 2023

View reviewed changes

cpp/src/arrow/record_batch.h Show resolved Hide resolved

add reserve and assert recordbatch

5996810

Light-City force-pushed the GH-37895 branch from 8bba6d2 to 5996810 Compare October 13, 2023 15:05

bkietz approved these changes Oct 13, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 13, 2023

bkietz merged commit c5bce96 into apache:main Oct 13, 2023
33 of 35 checks passed

bkietz removed the awaiting merge Awaiting merge label Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37895: [C++] Feature: support concatenate recordbatches. #37896

GH-37895: [C++] Feature: support concatenate recordbatches. #37896

Light-City commented Sep 27, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Sep 27, 2023

bkietz Sep 27, 2023

Light-City Sep 28, 2023 •

edited

Loading

Light-City Sep 28, 2023

js8544 Oct 13, 2023 •

edited

Loading

Light-City Oct 13, 2023

js8544 Oct 13, 2023

Light-City Oct 13, 2023

Light-City Oct 13, 2023

js8544 Oct 13, 2023 •

edited

Loading

js8544 left a comment

js8544 commented Oct 13, 2023

Light-City commented Oct 13, 2023

Light-City commented Oct 13, 2023

bkietz commented Oct 13, 2023

conbench-apache-arrow bot commented Oct 17, 2023

GH-37895: [C++] Feature: support concatenate recordbatches. #37896

GH-37895: [C++] Feature: support concatenate recordbatches. #37896

Conversation

Light-City commented Sep 27, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 27, 2023

bkietz Sep 27, 2023

Choose a reason for hiding this comment

Light-City Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Light-City Sep 28, 2023

Choose a reason for hiding this comment

js8544 Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Light-City Oct 13, 2023

Choose a reason for hiding this comment

js8544 Oct 13, 2023

Choose a reason for hiding this comment

Light-City Oct 13, 2023

Choose a reason for hiding this comment

Light-City Oct 13, 2023

Choose a reason for hiding this comment

js8544 Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

js8544 left a comment

Choose a reason for hiding this comment

js8544 commented Oct 13, 2023

Light-City commented Oct 13, 2023

Light-City commented Oct 13, 2023

bkietz commented Oct 13, 2023

conbench-apache-arrow bot commented Oct 17, 2023

Light-City commented Sep 27, 2023 •

edited by github-actions bot

Loading

Light-City Sep 28, 2023 •

edited

Loading

js8544 Oct 13, 2023 •

edited

Loading

js8544 Oct 13, 2023 •

edited

Loading