-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] A metadata standard for sorted datasets. #34451
Comments
We're pretty close in Acero to the point where we can use this for real advantage internally. For example, there is a PR (#34311) up for a more RAM-efficient aggregation if we know the data is segmented / sorted. Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data. However, if we had a metadata standard in place then pyarrow/R (for in-memory tables) and datasets (for on-disk tables) could automatically detect this condition and apply the more efficient aggregation. There are probably a few unknowns about how exactly that should happen (An optimization pass based on data statistics (e.g. orderedness)? Detected on the fly while running a plan?) but getting a standard in for representing this information would be a good first step. |
In some work I'm doing I plan on using metadata keys that are system-specific. It seems that part of the requirement here would be a way of "namespacing" metadata attributes for cooperative system design. I would also think that, at minimum, the following namespaces would always potentially co-exist:
With the following possible caveats:
Not sure if there has been any other proposal of metadata management in arrow that should be leveraged. |
Sortedness metadata has been discussed on the ML before. The reception seemed generally favorable though no proposal was ever put forward. Similarly with min/max metadata. In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format (e.g. Iceberg) As for namespacing, there is some precedent in the spec:
I don't think there is a need at the moment for an Acero namespace as the only thing (so far) that Acero would be interested in would be sortedness and min/max statistics and/or index information. All of this should be universally applicable and ideally agreed upon across all implementations and not just in Acero (though we could start there while doing initial work with the hope of making a proposal for the wider community). |
this is super helpful. This is the relevant PR for datafusion: apache/datafusion#1776. @alamb , if you have extra input it'd be nice to hear.
I think this addresses the aspect I was most concerned with, though namespace reservation hadn't totally occurred to me (awkward if multiple projects have conflicting namespaces). Probably in the short term this can be assumed to not be problematic (so acero can claim an arbitrary namespace when it's reasonable to do so).
The interesting thing here seems to be that for substrait plans that reference a table (e.g. |
I won't commit yet to proposing a particular standard, but maybe I can help guide the discussion to consider a couple ways sorted information is represented and which aspects we would like to preserve in a standard. |
Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness) DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system) Things that should be covered:
|
wanted to mention that #32884 likely has some relevance |
Describe the enhancement requested
Split off from #34153.
In order to take advantage of sorted columns, it would be necessary for arrow to standardize on a way to represent sorting in datasets/table metadata. Akin to to pandas'
index_columns
.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: