Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37895: [C++] Feature: support concatenate recordbatches. #37896

Merged
merged 3 commits into from
Oct 13, 2023

Conversation

Light-City
Copy link
Contributor

@Light-City Light-City commented Sep 27, 2023

Rationale for this change

User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together.

What changes are included in this PR?

record_batch.cc
record_batch.h
record_batch_test.cc

Are these changes tested?

yes, see record_batch_test.cc

Are there any user-facing changes?

yes

@github-actions
Copy link

⚠️ GitHub issue #37895 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/record_batch.h Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
Comment on lines 449 to 451
if (cols == 0) {
// special case: null batch, no data, just length
for (size_t i = 0; i < batches.size(); ++i) {
length += batches[i]->num_rows();
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a special case for zero columns:

Suggested change
if (cols == 0) {
// special case: null batch, no data, just length
for (size_t i = 0; i < batches.size(); ++i) {
length += batches[i]->num_rows();
}
} else {
// special case: null batch, no data, just length
for (size_t i = 0; i < batches.size(); ++i) {
length += batches[i]->num_rows();
}

Then the loop for (int col = 0; col < cols; ++col) will just never be entered if cols == 0

Copy link
Contributor Author

@Light-City Light-City Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, My pseudocode is below, with the else branch

if (cols == 0) {
    // special case: null batch, no data, just length
    for (size_t i = 0; i < batches.size(); ++i) {
      length += batches[i]->num_rows();
    }
  } else {
   for (int col = 0; col < cols; ++col)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the special case of column 0 is processed here is that if we want to implement count(*), then this RecordBatch does not store data, that is, only the length. This scenario is triggered when small batches need to be merged.

Copy link
Collaborator

@js8544 js8544 Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following change make any difference on the result?

Result<std::shared_ptr<RecordBatch>> ConcatenateRecordBatches(
    const RecordBatchVector& batches, MemoryPool* pool) {
  int64_t length = 0;
  size_t n = batches.size();
  if (n == 0) {
    return Status::Invalid("Must pass at least one recordbatch");
  }
  int cols = batches[0]->num_columns();
  auto schema = batches[0]->schema();
  for (int i = 0; i < batches.size(); ++i) {
    length += batches[i]->num_rows();
    if (!schema->Equals(batches[i]->schema())) {
      return Status::Invalid(
          "Schema of RecordBatch index ", i, " is ", batches[i]->schema()->ToString(),
          ", which does not match index 0 recordbatch schema: ", schema->ToString());
    }
  }

  std::vector<std::shared_ptr<Array>> concatenated_columns;
  for (int col = 0; col < cols; ++col) {
    ArrayVector column_arrays;
    for (const auto & batch : batches) {
      column_arrays.emplace_back(std::move(batch->column(col)));
    }
    ARROW_ASSIGN_OR_RAISE(auto concatenated_column, Concatenate(column_arrays, pool))
    concatenated_columns.emplace_back(std::move(concatenated_column));
  }
  return RecordBatch::Make(std::move(schema), length, std::move(concatenated_columns));
}

I assumed:

  1. Schema checking can be moved out of the nested loop.
  2. The result length is just the sum of all batch lengths, no matter how many (or zero) columns there are.
  3. Each column has the same length in a RecordBatch, so no checking of column length is needed here.
  4. Variable names like column_data, data, columns are ambiguous so I changed them to more descriptive names.
  5. More data can be std::moved to avoid copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks very refreshing, but it does not handle the case of cols = 0. In this case, cols = 0, length is not obtained.

For a batch, there may be cols = 0, the schema is empty, but there is length. This kind of batch is used to process count(*) and does not store the actual columns, only the number of rows.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int i = 0; i < batches.size(); ++i) {
    length += batches[i]->num_rows();
    ...
}

Doesn't this include the cols=0 case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are correct, it is handled here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the modification. This code modification has been submitted.
note:

column_arrays.emplace_back(std::move(batch->column(col)));

Change to

column_arrays.emplace_back(batch->column(col));

cpp/src/arrow/record_batch_test.cc Outdated Show resolved Hide resolved
Comment on lines 449 to 451
if (cols == 0) {
// special case: null batch, no data, just length
for (size_t i = 0; i < batches.size(); ++i) {
length += batches[i]->num_rows();
}
} else {
Copy link
Collaborator

@js8544 js8544 Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following change make any difference on the result?

Result<std::shared_ptr<RecordBatch>> ConcatenateRecordBatches(
    const RecordBatchVector& batches, MemoryPool* pool) {
  int64_t length = 0;
  size_t n = batches.size();
  if (n == 0) {
    return Status::Invalid("Must pass at least one recordbatch");
  }
  int cols = batches[0]->num_columns();
  auto schema = batches[0]->schema();
  for (int i = 0; i < batches.size(); ++i) {
    length += batches[i]->num_rows();
    if (!schema->Equals(batches[i]->schema())) {
      return Status::Invalid(
          "Schema of RecordBatch index ", i, " is ", batches[i]->schema()->ToString(),
          ", which does not match index 0 recordbatch schema: ", schema->ToString());
    }
  }

  std::vector<std::shared_ptr<Array>> concatenated_columns;
  for (int col = 0; col < cols; ++col) {
    ArrayVector column_arrays;
    for (const auto & batch : batches) {
      column_arrays.emplace_back(std::move(batch->column(col)));
    }
    ARROW_ASSIGN_OR_RAISE(auto concatenated_column, Concatenate(column_arrays, pool))
    concatenated_columns.emplace_back(std::move(concatenated_column));
  }
  return RecordBatch::Make(std::move(schema), length, std::move(concatenated_columns));
}

I assumed:

  1. Schema checking can be moved out of the nested loop.
  2. The result length is just the sum of all batch lengths, no matter how many (or zero) columns there are.
  3. Each column has the same length in a RecordBatch, so no checking of column length is needed here.
  4. Variable names like column_data, data, columns are ambiguous so I changed them to more descriptive names.
  5. More data can be std::moved to avoid copy.

cpp/src/arrow/record_batch_test.cc Show resolved Hide resolved
Copy link
Collaborator

@js8544 js8544 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks! But it will need @bkietz's approval before merging.

@js8544
Copy link
Collaborator

js8544 commented Oct 13, 2023

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

@Light-City
Copy link
Contributor Author

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

+1, I am also very confused, mac and linux are compiled correctly

But Windows is


[345/901] Linking CXX executable release\arrow-table-test.exe
FAILED: release/arrow-table-test.exe 
cmd.exe /C "cd . && D:\a\_temp\msys64\clang64\bin\c++.exe -Qunused-arguments -fcolor-diagnostics  -Wa,-mbig-obj -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wdate-time -Wno-unknown-warning-option -Wno-pass-failed -mxsave -msse4.2  -O3 -DNDEBUG -O2  @CMakeFiles\arrow-table-test.rsp -o release\arrow-table-test.exe -Wl,--out-implib,release\libarrow-table-test.dll.a -Wl,--major-image-version,0,--minor-image-version,0  && cd ."
ld.lld: error: undefined symbol: arrow::ConcatenateRecordBatches(std::__1::vector<std::__1::shared_ptr<arrow::RecordBatch>, std::__1::allocator<std::__1::shared_ptr<arrow::RecordBatch>>> const&, arrow::MemoryPool*)
>>> referenced by src/arrow/CMakeFiles/arrow-table-test.dir/record_batch_test.cc.obj:(arrow::TestRecordBatch_ConcatenateRecordBatches_Test::TestBody())
>>> referenced by src/arrow/CMakeFiles/arrow-table-test.dir/record_batch_test.cc.obj:(arrow::TestRecordBatch_ConcatenateRecordBatches_Test::TestBody())

@Light-City
Copy link
Contributor Author

I think the C++ Windows Checks are failing due to this change but I couldn't figure out why...

Found that ARROW_EXPORT is missing, this should be the reason

cpp/src/arrow/record_batch.cc Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Show resolved Hide resolved
cpp/src/arrow/record_batch_test.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 13, 2023
@bkietz
Copy link
Member

bkietz commented Oct 13, 2023

CI failure seems unrelated
+1

@bkietz bkietz merged commit c5bce96 into apache:main Oct 13, 2023
33 of 35 checks passed
@bkietz bkietz removed the awaiting merge Awaiting merge label Oct 13, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit c5bce96.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…ache#37896)

### Rationale for this change

User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together.

### What changes are included in this PR?

record_batch.cc
record_batch.h
record_batch_test.cc

### Are these changes tested?

yes, see record_batch_test.cc

### Are there any user-facing changes?

yes

* Closes: apache#37895

Authored-by: light-city <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ache#37896)

### Rationale for this change

User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together.

### What changes are included in this PR?

record_batch.cc
record_batch.h
record_batch_test.cc

### Are these changes tested?

yes, see record_batch_test.cc

### Are there any user-facing changes?

yes

* Closes: apache#37895

Authored-by: light-city <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ache#37896)

### Rationale for this change

User scenario: When we use acero plan, many smaller batches may be generated through agg and hashjoin. In addition, due to the mpp database, there is data distribution. When there are many segments, each segment data is compared at this time. Small, in order to improve performance, we hope to merge multiple fragmented small batches into one large batch for calculation together.

### What changes are included in this PR?

record_batch.cc
record_batch.h
record_batch_test.cc

### Are these changes tested?

yes, see record_batch_test.cc

### Are there any user-facing changes?

yes

* Closes: apache#37895

Authored-by: light-city <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] support concatenate recordbatches.
4 participants