-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Dataset PartitioningFactory cannot be serialized in Python #34884
Comments
All the partitioning objects we have today can boil down to a schema (which can be saved as an empty parquet/arrow file) and a string denoting the type (e.g. "dictionary" or "hive" or "filename"). I'm not sure if this is helpful or not since I suspect the goal is automatic serialization. |
Thanks @westonpace ! The current workaround in Dask is indeed to allow the user to specify a dictionary like |
…classes (#36462) ### Rationale for this change Add support for pickling Directory/Hive/FilenamePartitioning objects. Does not yet actually fix the issue #34884, because this PR only addresses the actual Partitioning subclasses, and not the PartitioningFactory subclasses. ### Are these changes tested? Yes ### Are there any user-facing changes? Only new support for pickling and `==` operation. * Issue: #34884 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Weston Pace <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ngFactory objects
…ory objects (#36550) ### Rationale for this change #36462 already added support for pickling Partitioning objects, but not yet the PartitioningFactory objects. The problem for PartitioningFactory is that we currently don't really expose the full class hierarchy in python, just the base class PartitioningFactory. We also don't expose creating those factory objects, except through the `discover` methods of the Partitioning classes. I think it would be nice to keep this minimal binding, but that means if we want to make them serializable with pickle, we need another way to do that (and if we don't want to add custom code for serialization on the C++ side). In this PR, I went for the route of essentially storing the constructor (the discover static method) and the arguments that were passed to the constructor, on the factory object, so we can use this info for pickling. Not the nicest code, but the simplest solution I could think of. ### Are these changes tested? Yes * Closes: #34884 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Thanks @jorisvandenbossche ! |
Describe the bug, including details regarding any error messages, version, and platform.
I would like to be able to serialize a dictionary of
pyarrow.dataset.dataset
key-word arguments (for parallel processing in Dask). However, it is not possible to do this when one of those arguments contains aPartitioning
/PartitioningFactory
object, because those objects cannot be serialized in Python.Reproducer:
Component(s)
Python
The text was updated successfully, but these errors were encountered: