-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set_index calls quietly sort the Ensemble dataframes #242
Comments
The third option, which could be a default behavior, is checking the index to be sorted while reading the data. This is significantly cheaper than force-sorting, and would inform user if data must be sorted. |
Yeah it would be nice if we could scan and check for that. I'm not sure how costly it will be since Dask will have to actually verify that it is sorted, and that is an operation that can't be done lazily. |
Auto-sorting is removed as the default behavior in #276, and users now opt in to it via the There is the remaining question of whether to do a sort check on data load, per @hombit. I agree that it isn't too costly, but am not sure about the sequencing. If it checks to see if the data is sorted on load, the check will be triggered by the call where the user already has the opportunity to specify whether the data is sorted or not. This means they might need to immediately reload the ensemble data with a different kwarg set. Maybe this is fine? Having the ability to sort (#247 ) might also resolve this, as a user may load data, see that it's not sorted, and then use the sort call if it's wanted. |
Going to close this as the original scope of the issue has been addressed. |
Discovered this recently that the default behavior of
dd.set_index
is to sort the dataframe based on the index values. This introduces a costly overhead to the workflow, and is needless when the user already has their data sorted. In principle, TAPE should be able to function without a sorted index. We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.Additional note, if we are sorting the dataframes, it may be worthwhile to investigate generation of division information.
The text was updated successfully, but these errors were encountered: