Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17642: [C++] Add ordered aggregation #14352

Closed
wants to merge 14 commits into from
5 changes: 2 additions & 3 deletions cpp/src/arrow/compute/exec.cc
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,11 @@ ExecBatch ExecBatch::Slice(int64_t offset, int64_t length) const {
return out;
}

Result<ExecBatch> ExecBatch::Make(std::vector<Datum> values) {
if (values.empty()) {
Result<ExecBatch> ExecBatch::Make(std::vector<Datum> values, int64_t length) {
if (values.empty() && length < 0) {
return Status::Invalid("Cannot infer ExecBatch length without at least one value");
}

int64_t length = -1;
for (const auto& value : values) {
if (value.is_scalar()) {
continue;
Expand Down
13 changes: 12 additions & 1 deletion cpp/src/arrow/compute/exec.h
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ struct ARROW_EXPORT ExecBatch {

explicit ExecBatch(const RecordBatch& batch);

static Result<ExecBatch> Make(std::vector<Datum> values);
static Result<ExecBatch> Make(std::vector<Datum> values, int64_t length = -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively one looking at this might be confused why you now need both ExecBatch::Make and ExecBatch::ExecBatch since both take a vector of values and a length.

Looking closer it seems the Make function does the extra work of verifying that the length of the datums match the given length.

Could you add some comments explaining this for future readers?


Result<std::shared_ptr<RecordBatch>> ToRecordBatch(
std::shared_ptr<Schema> schema, MemoryPool* pool = default_memory_pool()) const;
Expand Down Expand Up @@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {

ExecBatch Slice(int64_t offset, int64_t length) const;

Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there aren't many comment blocks in this file but ExecBatch is less experimental than it used to be. Can you comment this method?

Also, is there any particular reason to have the implementation in the header file?

std::vector<Datum> selected_values(ids.size());
for (size_t i = 0; i < ids.size(); i++) {
if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
}
selected_values[i] = values[ids[i]];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<Datum> selected_values(ids.size());
for (size_t i = 0; i < ids.size(); i++) {
if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
}
selected_values[i] = values[ids[i]];
}
std::vector<Datum> selected_values;
selected_values.reserve(ids.size());
for (int idx : ids) {
if (idx < 0 || idx >= static_cast<int>(values.size())) {
return Status::Invalid("ExecBatch invalid value selection: ", idx);
}
selected_values.push_back(values[idx]);
}

Minor nit: this seems slightly more readable with a foreach loop.

return ExecBatch(std::move(selected_values), length);
}

/// \brief A convenience for returning the types from the batch.
std::vector<TypeHolder> GetTypes() const {
std::vector<TypeHolder> result;
Expand Down
Loading