[C++][Python] A metadata standard for sorted datasets. #34451

coady · 2023-03-04T02:45:11Z

Describe the enhancement requested

Split off from #34153.

In order to take advantage of sorted columns, it would be necessary for arrow to standardize on a way to represent sorting in datasets/table metadata. Akin to to pandas' index_columns.

Component(s)

C++, Python

The text was updated successfully, but these errors were encountered:

westonpace · 2023-03-06T20:24:10Z

We're pretty close in Acero to the point where we can use this for real advantage internally. For example, there is a PR (#34311) up for a more RAM-efficient aggregation if we know the data is segmented / sorted. Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.

However, if we had a metadata standard in place then pyarrow/R (for in-memory tables) and datasets (for on-disk tables) could automatically detect this condition and apply the more efficient aggregation. There are probably a few unknowns about how exactly that should happen (An optimization pass based on data statistics (e.g. orderedness)? Detected on the fly while running a plan?) but getting a standard in for representing this information would be a good first step.

drin · 2023-03-06T20:51:01Z

In some work I'm doing I plan on using metadata keys that are system-specific. It seems that part of the requirement here would be a way of "namespacing" metadata attributes for cooperative system design. I would also think that, at minimum, the following namespaces would always potentially co-exist:

acero (or arrow, generally)
substrait
"the system" (top-level system managing arrow data)
some application (whatever is interacting with the data stored in an arrow object)

With the following possible caveats:

(3) and (4) could be the same (maybe some large application such as duckdb that doesn't share the data with any other system).
(2) and (4) could be the same (maybe an application funnels all intent through a substrait producer).
If acero is never actively used, then conflicts with its namespace are moot (though prone to refactoring if acero is every adopted in the future)

Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.

westonpace · 2023-03-06T23:00:48Z

Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.

Sortedness metadata has been discussed on the ML before. The reception seemed generally favorable though no proposal was ever put forward. Similarly with min/max metadata.

In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format (e.g. Iceberg)

As for namespacing, there is some precedent in the spec:

The colon symbol : is to be used as a namespace separator. It can be used multiple times in a key.

The ARROW pattern is a reserved namespace for internal Arrow use in the custom_metadata fields. For example, ARROW:extension:name.

I don't think there is a need at the moment for an Acero namespace as the only thing (so far) that Acero would be interested in would be sortedness and min/max statistics and/or index information. All of this should be universally applicable and ideally agreed upon across all implementations and not just in Acero (though we could start there while doing initial work with the hope of making a proposal for the wider community).

drin · 2023-03-06T23:30:43Z

Sortedness metadata has been discussed on the ML before

this is super helpful. This is the relevant PR for datafusion: apache/datafusion#1776. @alamb , if you have extra input it'd be nice to hear.

As for namespacing, there is some precedent in the spec

I think this addresses the aspect I was most concerned with, though namespace reservation hadn't totally occurred to me (awkward if multiple projects have conflicting namespaces). Probably in the short term this can be assumed to not be problematic (so acero can claim an arbitrary namespace when it's reasonable to do so).

In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format

The interesting thing here seems to be that for substrait plans that reference a table (e.g. NamedTable), perhaps there will/should be an operator that actually just merges metadata (perhaps metadata, managed by different systems, will be maintained in separate catalogs/locations and want to be merged at some point in query execution).

drin · 2023-03-06T23:33:31Z

I won't commit yet to proposing a particular standard, but maybe I can help guide the discussion to consider a couple ways sorted information is represented and which aspects we would like to preserve in a standard.

alamb · 2023-03-07T16:42:25Z

this is super helpful. This is the relevant PR for datafusion: apache/datafusion#1776. @alamb , if you have extra input it'd be nice to hear.

Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.

Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness)

DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs

In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system)

Things that should be covered:

where do nulls sort (first or last)
ASC / DESC
Any collation considerations (ideally we would keep it as simple as possible)

drin · 2023-03-09T21:09:29Z

wanted to mention that #32884 likely has some relevance

coady added the Type: enhancement label Mar 4, 2023

github-actions bot added Component: C++ Component: Python labels Mar 4, 2023

rtpsw mentioned this issue Mar 6, 2023

GH-32884: [C++] Add ordered aggregation #34311

Merged

gitmodimo mentioned this issue Oct 9, 2024

GH-41706: [C++][Acero] Enhance asof_join to work in multi-threaded execution by sequencing input #44083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] A metadata standard for sorted datasets. #34451

[C++][Python] A metadata standard for sorted datasets. #34451

coady commented Mar 4, 2023

westonpace commented Mar 6, 2023

drin commented Mar 6, 2023 •

edited

Loading

westonpace commented Mar 6, 2023

drin commented Mar 6, 2023

drin commented Mar 6, 2023

alamb commented Mar 7, 2023

drin commented Mar 9, 2023

[C++][Python] A metadata standard for sorted datasets. #34451

[C++][Python] A metadata standard for sorted datasets. #34451

Comments

coady commented Mar 4, 2023

Describe the enhancement requested

Component(s)

westonpace commented Mar 6, 2023

drin commented Mar 6, 2023 • edited Loading

westonpace commented Mar 6, 2023

drin commented Mar 6, 2023

drin commented Mar 6, 2023

alamb commented Mar 7, 2023

drin commented Mar 9, 2023

drin commented Mar 6, 2023 •

edited

Loading