Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change meta estimation #409

Merged
merged 2 commits into from
Sep 13, 2023
Merged

change meta estimation #409

merged 2 commits into from
Sep 13, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli commented Sep 5, 2023

PR that fixes an issue with the load from hub component when setting an index dynamically. Previous implementation relied on estimating the meta/schema dynamically but this lead to some errors since sometimes the wrong type was inferred.

This PR fixes this issue by reading in the schema from the component spec

+ (dataframe.id.cumsum()).astype(str)
str(partition_info["number"])
+ "_"
+ (dataframe.id.cumsum()).astype(str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we set the id as str here. Was there no problem with inferring divisions with this ?

Copy link
Contributor Author

@PhilippeMoussalli PhilippeMoussalli Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mainly an issue if you cast an index that is already an integer to a string (or vice-versa) because you then change the order/sorting from numerical to lexicographical.

image

This shouldn't be the case here since we're creating a string index from the start. Dask will infer the divisions based on the lexicographical order of the index from the start (and then it should be preserved).

#391 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thx!

@GeorgesLorre GeorgesLorre self-requested a review September 13, 2023 09:49
@PhilippeMoussalli PhilippeMoussalli merged commit a079473 into main Sep 13, 2023
@PhilippeMoussalli PhilippeMoussalli deleted the fix-load-from-hub branch September 13, 2023 09:52
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PR that fixes an issue with the load from hub component when setting an
index dynamically. Previous implementation relied on estimating the
`meta/schema` dynamically but this lead to some errors since sometimes
the wrong type was inferred.

This PR fixes this issue by reading in the schema from the component
spec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants