Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17642: [C++] Add ordered aggregation #14352

Closed
wants to merge 14 commits into from
Closed

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Oct 8, 2022

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 8, 2022

This PR implements a generalization of ordered aggregation called segmented aggregation, in which the segment-keys split the input rows into segments of consecutive rows with common values for these segment-keys. In other words, the extra keys are not required to be ordered but only to change from one segment to the next. I opted to generalize this way because the extra implementation effort is quite small and the result is useful.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 8, 2022

Note that the current implementation does not support streaming. In particular, the segment-aggregation operation produces the entire output after processing the entire input. However, the current implementation gets close enough to allow a future PR to add support for streaming without too much effort.

cc @icexelloss

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 8, 2022

Note that this PR includes code from this and this PR, which are pending at this time.

@icexelloss
Copy link
Contributor

Thanks @rtpsw! Will try to take a look

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 15, 2022

@lidavidm, are you a good person to review this? or can you suggest someone? Also, note this PR includes a new SWMR gadget which may be of wider interest and could be upgraded to a visible internal API, WDYT?

@lidavidm
Copy link
Member

Weston would probably be best, but he's going to be busy with release things. Do you need a review soon?

The link to the gadget is dead, so I'm not sure what you're referencing

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 15, 2022

Weston would probably be best, but he's going to be busy with release things. Do you need a review soon?

Personally, I can hold. @icexelloss should answer if he needs this sooner.

The link to the gadget is dead, so I'm not sure what you're referencing

Sorry about that. Please see GatedSharedMutex and related code in cpp/src/arrow/compute/exec/aggregate_node.cc.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 15, 2022

The link to the gadget is dead, so I'm not sure what you're referencing

Sorry about that. Please see GatedSharedMutex and related code in cpp/src/arrow/compute/exec/aggregate_node.cc.

Turns out the link is OK but GitHub avoids scrolling it into view when the containing file has a large diff. When I click Load diff for this file I see GitHub highlights the correct block of code for the link. I'm not sure whether this is a bug or a feature of GitHub.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 15, 2022

Notes about the recent commit:

  • The PyArrow GroupBy API currently does not expose the segment_keys parameter. I think this should be done in a separate task.
  • The C++ GroupBy API currently does not support streaming. While the Grouper class it relies on has been refactored to make it easy to add streaming support, I think this is not a priority for the current PR because the important API is the aggregate node.
  • The aggregate node API now supports streaming. However, the existing tests only cover the case of an empty segment_keys parameter. These tests do cover the aggregate node's code in the PR but not that of all GroupingSegment implementations. I intend to add more tests here.
  • The SegmentKeyWithChunkedArray test only covers chunked arrays and a single-key segmenter. I intend to add tests for other datum kinds and segmenter types.

It would be good to get feedback on the approach for this PR even before I add further tests.

@icexelloss
Copy link
Contributor

@lidavidm We do not need this soon. This can wait a little bit (2 weeks or so)

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 17, 2022

Pushed a change I had in flight. Holding for now.

@rtpsw
Copy link
Contributor Author

rtpsw commented Oct 23, 2022

@lidavidm, are you a good person to review this? or can you suggest someone? Also, note this PR includes a new SWMR gadget which may be of wider interest and could be upgraded to a visible internal API, WDYT?

To be fair, there are plenty of possible mutex policies (example discussion). I think the first question is whether the usefulness to Arrow of the proposed policy and its implementation are worth a (separate) discussion.

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 6, 2022

@lidavidm, it may be a good time to get back to this. cc @icexelloss

@lidavidm
Copy link
Member

lidavidm commented Nov 7, 2022

Yes, sorry. I'll see if I can get to this soon. Or CC @westonpace.

cpp/src/arrow/compare.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate.h Show resolved Hide resolved
return -1;
}

bool HasConsistentChunks(const ExecBatch& batch, int index_of_chunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily an issue here, but maybe something like MultipleChunkIterator would let us avoid having to do this check?

/// \brief EXPERIMENTAL: Utility for incremental iteration over contiguous
/// pieces of potentially differently-chunked ChunkedArray objects
class ARROW_EXPORT MultipleChunkIterator {
public:
MultipleChunkIterator(const ChunkedArray& left, const ChunkedArray& right)
: left_(left),
right_(right),
pos_(0),
length_(left.length()),
chunk_idx_left_(0),
chunk_idx_right_(0),
chunk_pos_left_(0),
chunk_pos_right_(0) {}
bool Next(std::shared_ptr<Array>* next_left, std::shared_ptr<Array>* next_right);
int64_t position() const { return pos_; }
private:
const ChunkedArray& left_;
const ChunkedArray& right_;
// The amount of the entire ChunkedArray consumed
int64_t pos_;
// Length of the chunked array(s)
int64_t length_;
// Current left chunk
int chunk_idx_left_;
// Current right chunk
int chunk_idx_right_;
// Offset into the current left chunk
int64_t chunk_pos_left_;
// Offset into the current right chunk
int64_t chunk_pos_right_;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for requiring consistent chunks, which is defined as having the same sequence of chunk lengths for each (chunked-array) value in the batch, is that it ensures we can create an ExecSpan without fragmenting the chunks. For example, if we had one value with chunks lengths [3, 5] and a second value with chunk lengths [2, 6] then we'd need to fragment into 3 ExecSpan of lengths [2, 1, 5] since they each must cover rows from one chunk. Of course, the fragmentation could easily get much worse with many values having inconsistent chunks, and the performance cost of fragmentation would be annoying. So, while the requirement can be lifted using something like MultipleChunkIterator, I'm not sure it is worth it. Let me know your thoughts.

cpp/src/arrow/compute/row/grouper.h Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/grouper.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate.h Show resolved Hide resolved
cpp/src/arrow/compute/kernels/hash_aggregate_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/hash_aggregate_test.cc Outdated Show resolved Hide resolved
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only managed a cursory glance so far and I would like to take a closer look and will try to get a chance next week. At a high level I think I have a few worries:

  • It appears you have made some changes to allow exec batches to contain chunked arrays. However, I do not think an exec batch should ever need to contain a chunked array. If a grouping node wants to deal with some chunked arrays internally I think that is ok but it shouldn't put that complexity on the rest of the nodes.
  • I don't think we really want to maintain the GroupBy methods in aggregate.h. They do things slightly different than an ExecPlan and having two subtly different implementations is going to be a headache. I'd rather convert the python bindings to use an ExecPlan.

cpp/src/arrow/compute/exec/aggregate.h Show resolved Hide resolved
cpp/src/arrow/compute/exec.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/aggregate.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/options.h Show resolved Hide resolved
}

const char* kind_name() const override { return "ScalarAggregateNode"; }

Status DoConsume(const ExecSpan& batch, size_t thread_index) {
GatedSharedLock lock(gated_shared_mutex_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this lock needed? Each batch of incoming data should be more or less separably processable right? Or is this to handle the case where a segment spreads out across multiple batches? Ideally we should not be holding any locks when we make kernel functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these locks, I observed a race condition between the finishing of one segment, which reads accumulated states, and the consuming of the next segment, which accumulates states. The GatedSharedLock in the code ensure that the (multi-threaded) consuming of the next segment does not run until the finishing of the current segment completes. I can think of ways to reduce the locking, but they would be more complex and are better left for a future task.

cpp/src/arrow/compute/exec/aggregate.cc Outdated Show resolved Hide resolved
@lidavidm
Copy link
Member

There's some discussion here, I should have pinged you at the time: https://issues.apache.org/jira/browse/ARROW-17965

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 10, 2022

Thanks for all the comments. I'll try to post some quick responses, but I'll need to get back to this later, due to other priorities.

@westonpace
Copy link
Member

There's some discussion here, I should have pinged you at the time: https://issues.apache.org/jira/browse/ARROW-17965

No worries! I've been pretty busy these last few weeks so I don't know that I would have caught it anyways.

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 20, 2022

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 27, 2022

I'm still pretty reluctant to add code to handle chunked arrays. I feel it adds complexity that we will end up maintaining when chunked arrays don't really have a place in a streaming execution engine (since we process things once batch at a time usually).

This is understandable. I'll try to drop support for chunked arrays in this PR and report back on what seems to break; we may be able to find an alternative approach.

My investigation suggests that the reason for introducing handling of chunked arrays in the first place is that the testers use tables, and their implementing class SimpleTable has ChunkedArray columns (even after CombineChunks) that the aggregation code needs to handle. Therefore, if we remove support for chunked arrays in the aggregation code, then it won't work nicely with table inputs. AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

@icexelloss
Copy link
Contributor

icexelloss commented Nov 27, 2022 via email

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 27, 2022

Does the non ordered / hash aggregation handles chunk array?

AFAICS, yes, There are plenty of pre-PR GroupBy tests in hash_aggregate_test.cc, such as CountOnly and SumOnly, that create a table as input for (pre-PR unordered) aggregation using GroupBy (or AlternatorGroupBy in the recent commit of this PR).

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 27, 2022

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

I see the same failure for the latest commit, so same questions.

@rtpsw
Copy link
Contributor Author

rtpsw commented Nov 28, 2022

This CI job failure may be related to the recent commit, though I can't think of which change could have caused it. How to set up an environment to try to reproduce it? Or perhaps there is a way to grab the core dump or raw log file?

I see the same failure for the latest commit, so same questions.

I followed up on the job's workflow and found the command archery docker run java-jni-manylinux-2014 but on my machine this runs into errors I don't know how to deal with:

$ archery docker run java-jni-manylinux-2014
Traceback (most recent call last):
  File "/mnt/venv/bin/archery", line 33, in <module>
    sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1654, in invoke
    super().invoke(ctx)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/venv/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/cli.py", line 67, in docker
    compose = DockerCompose(config_path, params=os.environ)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 169, in __init__
    self.config = ComposeConfig(config_path, dotenv_path, compose_bin,
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 68, in __init__
    self._read_config(config_path, compose_bin)
  File "/mnt/github/rtpsw/arrow/dev/archery/archery/docker/core.py", line 136, in _read_config
    raise ValueError(
ValueError: Found errors with docker-compose:
 - /snap/docker/2285/lib/python3.6/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
 -   from cryptography.hazmat.backends import default_backend
 - Traceback (most recent call last):
 -   File "/snap/docker/2285/bin/docker-compose", line 33, in <module>
 -     sys.exit(load_entry_point('docker-compose==1.29.2', 'console_scripts', 'docker-compose')())
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 81, in main
 -     command_func()
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 197, in perform_command
 -     handler(command, command_options)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/metrics/decorator.py", line 18, in wrapper
 -     result = fn(*args, **kwargs)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/main.py", line 404, in config
 -     compose_config = get_config_from_options('.', self.toplevel_options, additional_options)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/cli/command.py", line 104, in get_config_from_options
 -     environment = Environment.from_env_file(override_dir or base_dir, environment_file)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 67, in from_env_file
 -     instance = _initialize()
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 62, in _initialize
 -     return cls(env_vars_from_file(env_file_path))
 -   File "/snap/docker/2285/lib/python3.6/site-packages/compose/config/environment.py", line 38, in env_vars_from_file
 -     env = dotenv.dotenv_values(dotenv_path=filename, encoding='utf-8-sig', interpolate=interpolate)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 372, in dotenv_values
 -     encoding=encoding,
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 74, in dict
 -     self._dict = OrderedDict(resolve_variables(raw_values, override=self.override))
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 231, in resolve_variables
 -     for (name, value) in values:
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 81, in parse
 -     with self._get_stream() as stream:
 -   File "/snap/docker/2285/usr/lib/python3.6/contextlib.py", line 81, in __enter__
 -     return next(self.gen)
 -   File "/snap/docker/2285/lib/python3.6/site-packages/dotenv/main.py", line 54, in _get_stream
 -     with io.open(self.dotenv_path, encoding=self.encoding) as stream:
 - PermissionError: [Errno 13] Permission denied: '/mnt/github/rtpsw/arrow/.env'

@westonpace
Copy link
Member

AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

Aggregation of tables should be done by using a table source node and an exec plan. Not by using the GroupBy function in aggregate.h. That was the intent behind the comment:

/// Internal use only: helpers for PyArrow and testing HashAggregateKernels.
/// For public use see arrow::compute::Grouper or create an execution plan
/// and use an aggregate node.

@rtpsw
Copy link
Contributor Author

rtpsw commented Dec 6, 2022

AFAIU, aggregating tables is a valid use case that should be supported. @westonpace, let me know your thoughts.

Aggregation of tables should be done by using a table source node and an exec plan. Not by using the GroupBy function in aggregate.h. That was the intent behind the comment:

/// Internal use only: helpers for PyArrow and testing HashAggregateKernels.
/// For public use see arrow::compute::Grouper or create an execution plan
/// and use an aggregate node.

IIUC, support of chunked arrays is needed only for GroupBy as a testing facility (and that this is the case with hash aggregation too), right? So, if we remove this support, some current testing using GroupBy would break and would need to be removed too. @westonpace, are you fine with this outcome?

@westonpace
Copy link
Member

IIUC, support of chunked arrays is needed only for GroupBy as a testing facility (and that this is #14352 too), right? So, if we remove this support, some current testing using GroupBy would break and would need to be removed too. @westonpace, are you fine with this outcome?

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

@rtpsw
Copy link
Contributor Author

rtpsw commented Dec 7, 2022

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

OK. Should we hold here until #14867 is merged? and then rebase here?

@rtpsw
Copy link
Contributor Author

rtpsw commented Dec 14, 2022

I have created #14867 which removes the internal GroupBy facility entirely (well, remaps it onto exec plans).

@westonpace, I'm confused about the next step here. Should this PR hold until yours is pushed and then merge it in?

@icexelloss
Copy link
Contributor

icexelloss commented Dec 14, 2022

I would like to take a deeper look at this before merging (sorry haven't been able to do it yet)

@westonpace
Copy link
Member

Should this PR hold until yours is pushed and then merge it in?

Unfortunately, I think that would be the easiest path. Otherwise we risk introducing a bunch of complexity that we don't need and just have to pull out later.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're testing a group by with the exec plan using segment keys but I might be missing something. So I don't think the changes in aggregate_node.cc are tested.

@@ -180,7 +180,7 @@ struct ARROW_EXPORT ExecBatch {

explicit ExecBatch(const RecordBatch& batch);

static Result<ExecBatch> Make(std::vector<Datum> values);
static Result<ExecBatch> Make(std::vector<Datum> values, int64_t length = -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively one looking at this might be confused why you now need both ExecBatch::Make and ExecBatch::ExecBatch since both take a vector of values and a length.

Looking closer it seems the Make function does the extra work of verifying that the length of the datums match the given length.

Could you add some comments explaining this for future readers?

Comment on lines 237 to 243
std::vector<Datum> selected_values(ids.size());
for (size_t i = 0; i < ids.size(); i++) {
if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
}
selected_values[i] = values[ids[i]];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<Datum> selected_values(ids.size());
for (size_t i = 0; i < ids.size(); i++) {
if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
}
selected_values[i] = values[ids[i]];
}
std::vector<Datum> selected_values;
selected_values.reserve(ids.size());
for (int idx : ids) {
if (idx < 0 || idx >= static_cast<int>(values.size())) {
return Status::Invalid("ExecBatch invalid value selection: ", idx);
}
selected_values.push_back(values[idx]);
}

Minor nit: this seems slightly more readable with a foreach loop.

@@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {

ExecBatch Slice(int64_t offset, int64_t length) const;

Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there aren't many comment blocks in this file but ExecBatch is less experimental than it used to be. Can you comment this method?

Also, is there any particular reason to have the implementation in the header file?

Comment on lines +82 to +83
/// Currently only uint32 indices will be produced, eventually the bit width will only
/// be as wide as necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Currently only uint32 indices will be produced, eventually the bit width will only
/// be as wide as necessary.

We can save this for a future PR or put it into a TODO but speculation of future issues should be left out of /// comment blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from an existing comment of a method by the same name. I don't mind removing it if you still think it should be.

@@ -44,7 +45,405 @@ namespace compute {

namespace {

struct GrouperImpl : Grouper {
inline const uint8_t* GetValuesAsBytes(const ArrayData& data, int64_t offset = 0) {
int64_t absolute_byte_offset = (data.offset + offset) * data.type->byte_width();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a DCHECK here to ensure that data.type->byte_width() > 0 (e.g. to ensure that this is a fixed-width data type)

base += keys.size();
for (size_t i = 0; i < segment_keys.size(); ++i) {
int segment_key_field_id = segment_key_field_ids[i];
output_fields[base + i] = input_schema->field(segment_key_field_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think output_fields is large enough to do output_fields[base + i]. I get a segmentation fault here.

Are the segment keys part of the output? I would expect them to be. However, I also don't see any place in GroupByNode::Finalize that is putting the segment keys into the output data.

@rtpsw
Copy link
Contributor Author

rtpsw commented Dec 23, 2022

I don't think you're testing a group by with the exec plan using segment keys but I might be missing something. So I don't think the changes in aggregate_node.cc are tested.

This looks like an oversight. Sorry about that. I'll work on this.

@rtpsw
Copy link
Contributor Author

rtpsw commented Mar 3, 2023

Replaced by #34311

@rtpsw rtpsw closed this Mar 3, 2023
icexelloss added a commit that referenced this pull request Mar 10, 2023
This PR implements "Segmented Aggregation" to the existing aggregation
node to improve aggregation on ordered data.

A segment group is defined as "a continuous chunk of data that have the
same segment key value. e.g, if the input data looks like

```
[0, 0, 0, 1, 2, 2] 
```

Then there are three segments `[0, 0, 0]` `[1]` `[2, 2]`

(Note the "group" in "segment group" here is added to differentiate from
"segment", which is defined as "a continuous chunk of data with in a
ExecBatch")

Segment aggregation can be used to replace existing hash aggregation in
the case that data are ordered. The benefit of this is
(1) We can output aggregation result earlier (as soon as a segment group
is fully consumed).
(2) We only need to hold partial aggregation for one segment group to
reduce memory usage.

See https://issues.apache.org/jira/browse/ARROW-17642

Replaces #14352
* Closes: #32884

Follow ups
=======
* #34475 
* #34529

---------

Co-authored-by: Li Jin <[email protected]>
@rtpsw rtpsw deleted the ARROW-17642 branch April 23, 2023 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants