Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] A metadata standard for sorted datasets. #34451

Open
coady opened this issue Mar 4, 2023 · 7 comments
Open

[C++][Python] A metadata standard for sorted datasets. #34451

coady opened this issue Mar 4, 2023 · 7 comments

Comments

@coady
Copy link

coady commented Mar 4, 2023

Describe the enhancement requested

Split off from #34153.

In order to take advantage of sorted columns, it would be necessary for arrow to standardize on a way to represent sorting in datasets/table metadata. Akin to to pandas' index_columns.

Component(s)

C++, Python

@westonpace
Copy link
Member

We're pretty close in Acero to the point where we can use this for real advantage internally. For example, there is a PR (#34311) up for a more RAM-efficient aggregation if we know the data is segmented / sorted. Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.

However, if we had a metadata standard in place then pyarrow/R (for in-memory tables) and datasets (for on-disk tables) could automatically detect this condition and apply the more efficient aggregation. There are probably a few unknowns about how exactly that should happen (An optimization pass based on data statistics (e.g. orderedness)? Detected on the fly while running a plan?) but getting a standard in for representing this information would be a good first step.

@drin
Copy link
Contributor

drin commented Mar 6, 2023

In some work I'm doing I plan on using metadata keys that are system-specific. It seems that part of the requirement here would be a way of "namespacing" metadata attributes for cooperative system design. I would also think that, at minimum, the following namespaces would always potentially co-exist:

  1. acero (or arrow, generally)
  2. substrait
  3. "the system" (top-level system managing arrow data)
  4. some application (whatever is interacting with the data stored in an arrow object)

With the following possible caveats:

  • (3) and (4) could be the same (maybe some large application such as duckdb that doesn't share the data with any other system).
  • (2) and (4) could be the same (maybe an application funnels all intent through a substrait producer).
  • If acero is never actively used, then conflicts with its namespace are moot (though prone to refactoring if acero is every adopted in the future)

Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.

@westonpace
Copy link
Member

Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.

Sortedness metadata has been discussed on the ML before. The reception seemed generally favorable though no proposal was ever put forward. Similarly with min/max metadata.

In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format (e.g. Iceberg)

As for namespacing, there is some precedent in the spec:

The colon symbol : is to be used as a namespace separator. It can be used multiple times in a key.

The ARROW pattern is a reserved namespace for internal Arrow use in the custom_metadata fields. For example, ARROW:extension:name.

I don't think there is a need at the moment for an Acero namespace as the only thing (so far) that Acero would be interested in would be sortedness and min/max statistics and/or index information. All of this should be universally applicable and ideally agreed upon across all implementations and not just in Acero (though we could start there while doing initial work with the hope of making a proposal for the wider community).

@drin
Copy link
Contributor

drin commented Mar 6, 2023

Sortedness metadata has been discussed on the ML before

this is super helpful. This is the relevant PR for datafusion: apache/datafusion#1776. @alamb , if you have extra input it'd be nice to hear.

As for namespacing, there is some precedent in the spec

I think this addresses the aspect I was most concerned with, though namespace reservation hadn't totally occurred to me (awkward if multiple projects have conflicting namespaces). Probably in the short term this can be assumed to not be problematic (so acero can claim an arbitrary namespace when it's reasonable to do so).

In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format

The interesting thing here seems to be that for substrait plans that reference a table (e.g. NamedTable), perhaps there will/should be an operator that actually just merges metadata (perhaps metadata, managed by different systems, will be maintained in separate catalogs/locations and want to be merged at some point in query execution).

@drin
Copy link
Contributor

drin commented Mar 6, 2023

I won't commit yet to proposing a particular standard, but maybe I can help guide the discussion to consider a couple ways sorted information is represented and which aspects we would like to preserve in a standard.

@alamb
Copy link
Contributor

alamb commented Mar 7, 2023

this is super helpful. This is the relevant PR for datafusion: apache/datafusion#1776. @alamb , if you have extra input it'd be nice to hear.

Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.

Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness)

DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs

In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system)

Things that should be covered:

  1. where do nulls sort (first or last)
  2. ASC / DESC
  3. Any collation considerations (ideally we would keep it as simple as possible)

@drin
Copy link
Contributor

drin commented Mar 9, 2023

wanted to mention that #32884 likely has some relevance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants