ARROW-17642: [C++] Add ordered aggregation #14352

rtpsw · 2022-10-08T21:22:57Z

See https://issues.apache.org/jira/browse/ARROW-17642

github-actions · 2022-10-08T21:23:17Z

https://issues.apache.org/jira/browse/ARROW-17642

rtpsw · 2022-10-08T21:27:20Z

This PR implements a generalization of ordered aggregation called segmented aggregation, in which the segment-keys split the input rows into segments of consecutive rows with common values for these segment-keys. In other words, the extra keys are not required to be ordered but only to change from one segment to the next. I opted to generalize this way because the extra implementation effort is quite small and the result is useful.

rtpsw · 2022-10-08T21:32:21Z

Note that the current implementation does not support streaming. In particular, the segment-aggregation operation produces the entire output after processing the entire input. However, the current implementation gets close enough to allow a future PR to add support for streaming without too much effort.

cc @icexelloss

rtpsw · 2022-10-08T21:42:56Z

Note that this PR includes code from this and this PR, which are pending at this time.

icexelloss · 2022-10-10T19:38:16Z

Thanks @rtpsw! Will try to take a look

rtpsw · 2022-10-15T09:09:26Z

@lidavidm, are you a good person to review this? or can you suggest someone? Also, note this PR includes a new SWMR gadget which may be of wider interest and could be upgraded to a visible internal API, WDYT?

lidavidm · 2022-10-15T13:42:16Z

Weston would probably be best, but he's going to be busy with release things. Do you need a review soon?

The link to the gadget is dead, so I'm not sure what you're referencing

rtpsw · 2022-10-15T13:52:22Z

Weston would probably be best, but he's going to be busy with release things. Do you need a review soon?

Personally, I can hold. @icexelloss should answer if he needs this sooner.

The link to the gadget is dead, so I'm not sure what you're referencing

Sorry about that. Please see GatedSharedMutex and related code in cpp/src/arrow/compute/exec/aggregate_node.cc.

rtpsw · 2022-10-15T19:33:34Z

The link to the gadget is dead, so I'm not sure what you're referencing

Sorry about that. Please see GatedSharedMutex and related code in cpp/src/arrow/compute/exec/aggregate_node.cc.

Turns out the link is OK but GitHub avoids scrolling it into view when the containing file has a large diff. When I click Load diff for this file I see GitHub highlights the correct block of code for the link. I'm not sure whether this is a bug or a feature of GitHub.

rtpsw · 2022-10-15T19:35:47Z

Notes about the recent commit:

The PyArrow GroupBy API currently does not expose the segment_keys parameter. I think this should be done in a separate task.
The C++ GroupBy API currently does not support streaming. While the Grouper class it relies on has been refactored to make it easy to add streaming support, I think this is not a priority for the current PR because the important API is the aggregate node.
The aggregate node API now supports streaming. However, the existing tests only cover the case of an empty segment_keys parameter. These tests do cover the aggregate node's code in the PR but not that of all GroupingSegment implementations. I intend to add more tests here.
The SegmentKeyWithChunkedArray test only covers chunked arrays and a single-key segmenter. I intend to add tests for other datum kinds and segmenter types.

It would be good to get feedback on the approach for this PR even before I add further tests.

cpp/src/arrow/compute/exec/aggregate.h

icexelloss · 2022-10-17T13:18:54Z

@lidavidm We do not need this soon. This can wait a little bit (2 weeks or so)

rtpsw · 2022-10-17T15:18:07Z

Pushed a change I had in flight. Holding for now.

rtpsw · 2022-10-23T09:41:01Z

@lidavidm, are you a good person to review this? or can you suggest someone? Also, note this PR includes a new SWMR gadget which may be of wider interest and could be upgraded to a visible internal API, WDYT?

To be fair, there are plenty of possible mutex policies (example discussion). I think the first question is whether the usefulness to Arrow of the proposed policy and its implementation are worth a (separate) discussion.

rtpsw · 2022-11-06T14:00:52Z

@lidavidm, it may be a good time to get back to this. cc @icexelloss

lidavidm · 2022-11-07T12:41:52Z

Yes, sorry. I'll see if I can get to this soon. Or CC @westonpace.

cpp/src/arrow/compare.cc

cpp/src/arrow/compute/exec/aggregate.h

lidavidm · 2022-11-08T13:31:34Z

cpp/src/arrow/compute/row/grouper.cc

+    return -1;
+  }
+
+  bool HasConsistentChunks(const ExecBatch& batch, int index_of_chunk) {


Not necessarily an issue here, but maybe something like MultipleChunkIterator would let us avoid having to do this check?

arrow/cpp/src/arrow/chunked_array.h

Lines 198 to 237 in 5889c78

/// \brief EXPERIMENTAL: Utility for incremental iteration over contiguous

/// pieces of potentially differently-chunked ChunkedArray objects

class ARROW_EXPORT MultipleChunkIterator {

public:

MultipleChunkIterator(const ChunkedArray& left, const ChunkedArray& right)

: left_(left),

right_(right),

pos_(0),

length_(left.length()),

chunk_idx_left_(0),

chunk_idx_right_(0),

chunk_pos_left_(0),

chunk_pos_right_(0) {}

bool Next(std::shared_ptr<Array>* next_left, std::shared_ptr<Array>* next_right);

int64_t position() const { return pos_; }

private:

const ChunkedArray& left_;

const ChunkedArray& right_;

// The amount of the entire ChunkedArray consumed

int64_t pos_;

// Length of the chunked array(s)

int64_t length_;

// Current left chunk

int chunk_idx_left_;

// Current right chunk

int chunk_idx_right_;

// Offset into the current left chunk

int64_t chunk_pos_left_;

// Offset into the current right chunk

int64_t chunk_pos_right_;

};

The reason for requiring consistent chunks, which is defined as having the same sequence of chunk lengths for each (chunked-array) value in the batch, is that it ensures we can create an ExecSpan without fragmenting the chunks. For example, if we had one value with chunks lengths [3, 5] and a second value with chunk lengths [2, 6] then we'd need to fragment into 3 ExecSpan of lengths [2, 1, 5] since they each must cover rows from one chunk. Of course, the fragmentation could easily get much worse with many values having inconsistent chunks, and the performance cost of fragmentation would be annoying. So, while the requirement can be lifted using something like MultipleChunkIterator, I'm not sure it is worth it. Let me know your thoughts.

cpp/src/arrow/compute/row/grouper.h

cpp/src/arrow/compute/row/grouper.cc

cpp/src/arrow/compute/exec/aggregate.h

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

westonpace

I've only managed a cursory glance so far and I would like to take a closer look and will try to get a chance next week. At a high level I think I have a few worries:

It appears you have made some changes to allow exec batches to contain chunked arrays. However, I do not think an exec batch should ever need to contain a chunked array. If a grouping node wants to deal with some chunked arrays internally I think that is ok but it shouldn't put that complexity on the rest of the nodes.
I don't think we really want to maintain the GroupBy methods in aggregate.h. They do things slightly different than an ExecPlan and having two subtly different implementations is going to be a headache. I'd rather convert the python bindings to use an ExecPlan.

cpp/src/arrow/compute/exec/aggregate.h

cpp/src/arrow/compute/exec.cc

cpp/src/arrow/compute/exec/aggregate.cc

cpp/src/arrow/compute/exec/options.h

westonpace · 2022-11-10T20:27:53Z

cpp/src/arrow/compute/exec/aggregate_node.cc

  }

  const char* kind_name() const override { return "ScalarAggregateNode"; }

  Status DoConsume(const ExecSpan& batch, size_t thread_index) {
+    GatedSharedLock lock(gated_shared_mutex_);


Why is this lock needed? Each batch of incoming data should be more or less separably processable right? Or is this to handle the case where a segment spreads out across multiple batches? Ideally we should not be holding any locks when we make kernel functions.

Without these locks, I observed a race condition between the finishing of one segment, which reads accumulated states, and the consuming of the next segment, which accumulates states. The GatedSharedLock in the code ensure that the (multi-threaded) consuming of the next segment does not run until the finishing of the current segment completes. I can think of ways to reduce the locking, but they would be more complex and are better left for a future task.

cpp/src/arrow/compute/exec/aggregate.cc

lidavidm · 2022-11-10T20:42:10Z

There's some discussion here, I should have pinged you at the time: https://issues.apache.org/jira/browse/ARROW-17965

rtpsw · 2022-11-10T20:59:40Z

Thanks for all the comments. I'll try to post some quick responses, but I'll need to get back to this later, due to other priorities.

westonpace · 2022-11-10T22:19:15Z

There's some discussion here, I should have pinged you at the time: https://issues.apache.org/jira/browse/ARROW-17965

No worries! I've been pretty busy these last few weeks so I don't know that I would have caught it anyways.

rtpsw · 2022-11-20T19:41:37Z

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

rtpsw · 2022-11-27T15:07:47Z

I'm still pretty reluctant to add code to handle chunked arrays. I feel it adds complexity that we will end up maintaining when chunked arrays don't really have a place in a streaming execution engine (since we process things once batch at a time usually).

This is understandable. I'll try to drop support for chunked arrays in this PR and report back on what seems to break; we may be able to find an alternative approach.

My investigation suggests that the reason for introducing handling of chunked arrays in the first place is that the testers use tables, and their implementing class SimpleTable has ChunkedArray columns (even after CombineChunks) that the aggregation code needs to handle. Therefore, if we remove support for chunked arrays in the aggregation code, then it won't work nicely with table inputs. AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

icexelloss · 2022-11-27T15:29:50Z

Does the non ordered / hash aggregation handles chunk array?

…

On Sun, Nov 27, 2022 at 4:07 PM rtpsw ***@***.***> wrote: I'm still pretty reluctant to add code to handle chunked arrays. I feel it adds complexity that we will end up maintaining when chunked arrays don't really have a place in a streaming execution engine (since we process things once batch at a time usually). This is understandable. I'll try to drop support for chunked arrays in this PR and report back on what seems to break; we may be able to find an alternative approach. My investigation suggests that the reason for introducing chunks in the first place is that the testers use tables, and their implementing class SimpleTable has ChunkedArray columns (even after CombineChunks) that the aggregation code needs to handle. Therefore, if we remove support for chunked arrays in the aggregation code, then it won't work nicely with table inputs. AFAIU, aggregating tables is a valid use case that should be supported. @westonpace <https://github.com/westonpace>, let me know you thoughts. — Reply to this email directly, view it on GitHub <#14352 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGBXLDTXW4NCU2VWOULFW3WKN2M5ANCNFSM6AAAAAARANAUMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rtpsw · 2022-11-27T16:13:12Z

Does the non ordered / hash aggregation handles chunk array?

AFAICS, yes, There are plenty of pre-PR GroupBy tests in hash_aggregate_test.cc, such as CountOnly and SumOnly, that create a table as input for (pre-PR unordered) aggregation using GroupBy (or AlternatorGroupBy in the recent commit of this PR).

rtpsw · 2022-11-27T16:31:18Z

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

I see the same failure for the latest commit, so same questions.

rtpsw · 2022-11-28T11:14:07Z

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

I see the same failure for the latest commit, so same questions.

I followed up on the job's workflow and found the command archery docker run java-jni-manylinux-2014 but on my machine this runs into errors I don't know how to deal with:

$ archery docker run java-jni-manylinux-2014
Traceback (most recent call last):
  File "/mnt/venv/bin/archery", line 33, in <module>
    sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1654, in invoke
    super().invoke(ctx)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/venv/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/cli.py", line 67, in docker
    compose = DockerCompose(config_path, params=os.environ)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 169, in __init__
    self.config = ComposeConfig(config_path, dotenv_path, compose_bin,
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 68, in __init__
    self._read_config(config_path, compose_bin)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 136, in _read_config
    raise ValueError(
ValueError: Found errors with docker-compose:
 - /snap/docker/2285/lib/python3.6/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
 -   from cryptography.hazmat.backends import default_backend
 - Traceback (most recent call last):
 -   File "/snap/docker/2285/bin/docker-compose", line 33, in <module>
 -     sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')())
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 81, in main
 -     command_func()
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 197, in perform_command
 -     handler(command, command_options)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/metrics/decorator.py", line 18, in wrapper
 -     result = fn(*args, **kwargs)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 404, in config
 -     compose_config = get_config_from_options('.', self.toplevel_options, additional_options)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/command.py", line 104, in get_config_from_options
 -     environment = Environment.from_env_file(override_dir or base_dir, environment_file)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 67, in from_env_file
 -     instance = _initialize()
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 62, in _initialize
 -     return cls(env_vars_from_file(env_file_path))
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 38, in env_vars_from_file
 -     env = dotenv.dotenv_values(dotenv_path=filename, encoding='utf-8-sig', interpolate=interpolate)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 372, in dotenv_values
 -     encoding=encoding,
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 74, in dict
 -     self._dict = OrderedDict(resolve_variables(raw_values, override=self.override))
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 231, in resolve_variables
 -     for (name, value) in values:
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 81, in parse
 -     with self._get_stream() as stream:
 -   File "/snap/docker/2285/usr/lib/python3.6/contextlib.py", line 81, in __enter__
 -     return next(self.gen)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 54, in _get_stream
 -     with io.open(self.dotenv_path, encoding=self.encoding) as stream:
 - PermissionError: [Errno 13] Permission denied: '/mnt/github/rtpsw/arrow/.env'

westonpace · 2022-12-06T08:05:26Z

AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

Aggregation of tables should be done by using a table source node and an exec plan. Not by using the GroupBy function in aggregate.h. That was the intent behind the comment:

/// Internal use only: helpers for PyArrow and testing HashAggregateKernels.
/// For public use see arrow::compute::Grouper or create an execution plan
/// and use an aggregate node.

rtpsw · 2022-12-06T10:13:22Z

AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

Aggregation of tables should be done by using a table source node and an exec plan. Not by using the GroupBy function in aggregate.h. That was the intent behind the comment:
/// Internal use only: helpers for PyArrow and testing HashAggregateKernels.
/// For public use see arrow::compute::Grouper or create an execution plan
/// and use an aggregate node.

IIUC, support of chunked arrays is needed only for GroupBy as a testing facility (and that this is the case with hash aggregation too), right? So, if we remove this support, some current testing using GroupBy would break and would need to be removed too. @westonpace, are you fine with this outcome?

westonpace · 2022-12-07T08:46:44Z

IIUC, support of chunked arrays is needed only for GroupBy as a testing facility (and that this is #14352 too), right? So, if we remove this support, some current testing using GroupBy would break and would need to be removed too. @westonpace, are you fine with this outcome?

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

rtpsw · 2022-12-07T09:26:59Z

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

OK. Should we hold here until #14867 is merged? and then rebase here?

rtpsw · 2022-12-14T13:34:35Z

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

@westonpace, I'm confused about the next step here. Should this PR hold until yours is pushed and then merge it in?

icexelloss · 2022-12-14T19:35:58Z

I would like to take a deeper look at this before merging (sorry haven't been able to do it yet)

westonpace · 2022-12-15T22:48:04Z

Should this PR hold until yours is pushed and then merge it in?

Unfortunately, I think that would be the easiest path. Otherwise we risk introducing a bunch of complexity that we don't need and just have to pull out later.

westonpace

I don't think you're testing a group by with the exec plan using segment keys but I might be missing something. So I don't think the changes in aggregate_node.cc are tested.

westonpace · 2022-12-23T06:41:01Z

cpp/src/arrow/compute/exec.h

@@ -180,7 +180,7 @@ struct ARROW_EXPORT ExecBatch {

  explicit ExecBatch(const RecordBatch& batch);

-  static Result<ExecBatch> Make(std::vector<Datum> values);
+  static Result<ExecBatch> Make(std::vector<Datum> values, int64_t length = -1);


Naively one looking at this might be confused why you now need both ExecBatch::Make and ExecBatch::ExecBatch since both take a vector of values and a length.

Looking closer it seems the Make function does the extra work of verifying that the length of the datums match the given length.

Could you add some comments explaining this for future readers?

westonpace · 2022-12-23T06:43:24Z

cpp/src/arrow/compute/exec.h

+    std::vector<Datum> selected_values(ids.size());
+    for (size_t i = 0; i < ids.size(); i++) {
+      if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
+        return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
+      }
+      selected_values[i] = values[ids[i]];
+    }


Suggested change

std::vector<Datum> selected_values(ids.size());

for (size_t i = 0; i < ids.size(); i++) {

if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {

return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);

}

selected_values[i] = values[ids[i]];

}

std::vector<Datum> selected_values;

selected_values.reserve(ids.size());

for (int idx : ids) {

if (idx < 0 || idx >= static_cast<int>(values.size())) {

return Status::Invalid("ExecBatch invalid value selection: ", idx);

}

selected_values.push_back(values[idx]);

}

Minor nit: this seems slightly more readable with a foreach loop.

westonpace · 2022-12-23T06:44:44Z

cpp/src/arrow/compute/exec.h

@@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {

  ExecBatch Slice(int64_t offset, int64_t length) const;

+  Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {


I know there aren't many comment blocks in this file but ExecBatch is less experimental than it used to be. Can you comment this method?

Also, is there any particular reason to have the implementation in the header file?

westonpace · 2022-12-23T06:47:29Z

cpp/src/arrow/compute/row/grouper.h

+  /// Currently only uint32 indices will be produced, eventually the bit width will only
+  /// be as wide as necessary.


Suggested change

/// Currently only uint32 indices will be produced, eventually the bit width will only

/// be as wide as necessary.

We can save this for a future PR or put it into a TODO but speculation of future issues should be left out of /// comment blocks.

This is copied from an existing comment of a method by the same name. I don't mind removing it if you still think it should be.

westonpace · 2022-12-23T06:55:06Z

cpp/src/arrow/compute/row/grouper.cc

@@ -44,7 +45,405 @@ namespace compute {

 namespace {

-struct GrouperImpl : Grouper {
+inline const uint8_t* GetValuesAsBytes(const ArrayData& data, int64_t offset = 0) {
+  int64_t absolute_byte_offset = (data.offset + offset) * data.type->byte_width();


Maybe a DCHECK here to ensure that data.type->byte_width() > 0 (e.g. to ensure that this is a fixed-width data type)

westonpace · 2022-12-23T16:02:28Z

cpp/src/arrow/compute/exec/aggregate_node.cc

+    base += keys.size();
+    for (size_t i = 0; i < segment_keys.size(); ++i) {
+      int segment_key_field_id = segment_key_field_ids[i];
+      output_fields[base + i] = input_schema->field(segment_key_field_id);


I don't think output_fields is large enough to do output_fields[base + i]. I get a segmentation fault here.

Are the segment keys part of the output? I would expect them to be. However, I also don't see any place in GroupByNode::Finalize that is putting the segment keys into the output data.

rtpsw · 2022-12-23T22:07:31Z

I don't think you're testing a group by with the exec plan using segment keys but I might be missing something. So I don't think the changes in aggregate_node.cc are tested.

This looks like an oversight. Sorry about that. I'll work on this.

rtpsw · 2023-03-03T07:53:37Z

Replaced by #34311

This PR implements "Segmented Aggregation" to the existing aggregation node to improve aggregation on ordered data. A segment group is defined as "a continuous chunk of data that have the same segment key value. e.g, if the input data looks like ``` [0, 0, 0, 1, 2, 2] ``` Then there are three segments `[0, 0, 0]` `[1]` `[2, 2]` (Note the "group" in "segment group" here is added to differentiate from "segment", which is defined as "a continuous chunk of data with in a ExecBatch") Segment aggregation can be used to replace existing hash aggregation in the case that data are ordered. The benefit of this is (1) We can output aggregation result earlier (as soon as a segment group is fully consumed). (2) We only need to hold partial aggregation for one segment group to reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-17642 Replaces #14352 * Closes: #32884 Follow ups ======= * #34475 * #34529 --------- Co-authored-by: Li Jin <[email protected]>

ARROW-17642: [C++] Add ordered aggregation

8abc366

github-actions bot added the Component: C++ label Oct 8, 2022

aggregate-node support

9af5d1a

CI fixes

fd3b3e8

github-actions bot added the Component: Python label Oct 15, 2022

remove unused function

ec4e1ab

add tests

0aaed62

rtpsw commented Oct 17, 2022

View reviewed changes

cpp/src/arrow/compute/exec/aggregate.h Show resolved Hide resolved

expose streaming GroupBy

6901b1a

lidavidm reviewed Nov 8, 2022

View reviewed changes

westonpace reviewed Nov 10, 2022

View reviewed changes

requested fixes

ace4ef9

Merge branch 'master' into ARROW-17642

c90d2bf

grouping state clean up

8f280ee

westonpace mentioned this pull request Dec 7, 2022

[C++] Remove internal GroupBy implementation #14866

Closed

rtpsw mentioned this pull request Dec 22, 2022

GH-14866: [C++] Remove internal GroupBy implementation #14867

Merged

westonpace requested changes Dec 23, 2022

View reviewed changes

rtpsw added 4 commits December 24, 2022 10:31

Merge branch 'master' into ARROW-17642

60c70ca

fix merge

c9abbd6

requested fixes

26f991c

segmented group-by using plan

9940fa2

asfimport mentioned this pull request Dec 23, 2022

[C++] Add ordered aggregation #32884

Closed

rtpsw mentioned this pull request Feb 23, 2023

GH-32884: [C++] Add ordered aggregation #34311

Merged

rtpsw closed this Mar 3, 2023

rtpsw deleted the ARROW-17642 branch April 23, 2023 05:52

asfimport mentioned this pull request Nov 11, 2022

[C++] Add Grouper::Reset #33484

Closed

	/// \brief EXPERIMENTAL: Utility for incremental iteration over contiguous
	/// pieces of potentially differently-chunked ChunkedArray objects
	class ARROW_EXPORT MultipleChunkIterator {
	public:
	MultipleChunkIterator(const ChunkedArray& left, const ChunkedArray& right)
	: left_(left),
	right_(right),
	pos_(0),
	length_(left.length()),
	chunk_idx_left_(0),
	chunk_idx_right_(0),
	chunk_pos_left_(0),
	chunk_pos_right_(0) {}

	bool Next(std::shared_ptr<Array>* next_left, std::shared_ptr<Array>* next_right);

	int64_t position() const { return pos_; }

	private:
	const ChunkedArray& left_;
	const ChunkedArray& right_;

	// The amount of the entire ChunkedArray consumed
	int64_t pos_;

	// Length of the chunked array(s)
	int64_t length_;

	// Current left chunk
	int chunk_idx_left_;

	// Current right chunk
	int chunk_idx_right_;

	// Offset into the current left chunk
	int64_t chunk_pos_left_;

	// Offset into the current right chunk
	int64_t chunk_pos_right_;
	};

-    std::vector<Datum> selected_values(ids.size());
-    for (size_t i = 0; i < ids.size(); i++) {
-      if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
-        return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
-      }
-      selected_values[i] = values[ids[i]];
-    }
+    std::vector<Datum> selected_values;
+    selected_values.reserve(ids.size());
+    for (int idx : ids) {
+      if (idx < 0 || idx >= static_cast<int>(values.size())) {
+        return Status::Invalid("ExecBatch invalid value selection: ", idx);
+      }
+      selected_values.push_back(values[idx]);
+    }

		@@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {

		ExecBatch Slice(int64_t offset, int64_t length) const;

		Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {

		/// Currently only uint32 indices will be produced, eventually the bit width will only
		/// be as wide as necessary.

ARROW-17642: [C++] Add ordered aggregation #14352

ARROW-17642: [C++] Add ordered aggregation #14352

Conversation

rtpsw commented Oct 8, 2022

github-actions bot commented Oct 8, 2022

rtpsw commented Oct 8, 2022

rtpsw commented Oct 8, 2022

rtpsw commented Oct 8, 2022

icexelloss commented Oct 10, 2022

rtpsw commented Oct 15, 2022

lidavidm commented Oct 15, 2022

rtpsw commented Oct 15, 2022

rtpsw commented Oct 15, 2022

rtpsw commented Oct 15, 2022 • edited Loading

icexelloss commented Oct 17, 2022

rtpsw commented Oct 17, 2022

rtpsw commented Oct 23, 2022

rtpsw commented Nov 6, 2022

lidavidm commented Nov 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Nov 10, 2022

rtpsw commented Nov 10, 2022

westonpace commented Nov 10, 2022

rtpsw commented Nov 20, 2022

rtpsw commented Nov 27, 2022 • edited Loading

icexelloss commented Nov 27, 2022 via email

rtpsw commented Nov 27, 2022

rtpsw commented Nov 27, 2022

rtpsw commented Nov 28, 2022 • edited Loading

westonpace commented Dec 6, 2022

rtpsw commented Dec 6, 2022

westonpace commented Dec 7, 2022

rtpsw commented Dec 7, 2022

rtpsw commented Dec 14, 2022

icexelloss commented Dec 14, 2022 • edited Loading

westonpace commented Dec 15, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtpsw commented Dec 23, 2022

rtpsw commented Mar 3, 2023

rtpsw commented Oct 15, 2022 •

edited

Loading

rtpsw commented Nov 27, 2022 •

edited

Loading

rtpsw commented Nov 28, 2022 •

edited

Loading

icexelloss commented Dec 14, 2022 •

edited

Loading