set_index calls quietly sort the Ensemble dataframes #242

dougbrn · 2023-09-28T21:29:23Z

Discovered this recently that the default behavior of dd.set_index is to sort the dataframe based on the index values. This introduces a costly overhead to the workflow, and is needless when the user already has their data sorted. In principle, TAPE should be able to function without a sorted index. We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.

Additional note, if we are sorting the dataframes, it may be worthwhile to investigate generation of division information.

The text was updated successfully, but these errors were encountered:

hombit · 2023-09-29T12:20:27Z

We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.

The third option, which could be a default behavior, is checking the index to be sorted while reading the data. This is significantly cheaper than force-sorting, and would inform user if data must be sorted.

dougbrn · 2023-09-29T22:33:36Z

Yeah it would be nice if we could scan and check for that. I'm not sure how costly it will be since Dask will have to actually verify that it is sorted, and that is an operation that can't be done lazily.

dougbrn · 2023-10-25T16:31:06Z

Auto-sorting is removed as the default behavior in #276, and users now opt in to it via the sort and sorted flags in data loader functions a la Dask.

There is the remaining question of whether to do a sort check on data load, per @hombit. I agree that it isn't too costly, but am not sure about the sequencing. If it checks to see if the data is sorted on load, the check will be triggered by the call where the user already has the opportunity to specify whether the data is sorted or not. This means they might need to immediately reload the ensemble data with a different kwarg set. Maybe this is fine? Having the ability to sort (#247 ) might also resolve this, as a user may load data, see that it's not sorted, and then use the sort call if it's wanted.

dougbrn · 2023-11-29T21:09:00Z

Going to close this as the original scope of the issue has been addressed.

dougbrn added the bug Something isn't working label Sep 28, 2023

dougbrn mentioned this issue Oct 23, 2023

add check functions #276

Merged

dougbrn closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_index calls quietly sort the Ensemble dataframes #242

set_index calls quietly sort the Ensemble dataframes #242

dougbrn commented Sep 28, 2023

hombit commented Sep 29, 2023 •

edited

Loading

dougbrn commented Sep 29, 2023

dougbrn commented Oct 25, 2023

dougbrn commented Nov 29, 2023

set_index calls quietly sort the Ensemble dataframes #242

set_index calls quietly sort the Ensemble dataframes #242

Comments

dougbrn commented Sep 28, 2023

hombit commented Sep 29, 2023 • edited Loading

dougbrn commented Sep 29, 2023

dougbrn commented Oct 25, 2023

dougbrn commented Nov 29, 2023

hombit commented Sep 29, 2023 •

edited

Loading