Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaError breaks pickle #713

Closed
3 tasks done
matthiashuschle opened this issue Dec 18, 2021 · 2 comments
Closed
3 tasks done

SchemaError breaks pickle #713

matthiashuschle opened this issue Dec 18, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@matthiashuschle
Copy link
Contributor

Description
SchemaError (and probably SchemaErrors) have a couple of problems that make it impossible to use them with pickle:

  1. Pickling breaks when the schema attribute contains Check objects with lambdas or local functions from Check.isin or similar.
  2. Unpickling breaks always, as the signature differs from Exception and has more than one required positional argument.

This is relevant, because when a subprocess raises an uncaught exception - which might be the intention of using pandera - the exception object is part of the return value, which is piped to the parent process using pickle. This usecase also raises the third issue, that the size limit of these pipes is 2GiB per pickled object, and the data contained in the exception might easily become larger.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample

import pickle
import pandas as pd
from pandera import DataFrameSchema, Check, Column
from pandera.errors import SchemaError
data = pd.DataFrame({"a": [-1, 0, 1]})

# case 1 with Check.isin:
schema = DataFrameSchema({
    "a": Column(int, Check.isin([0, 1]))
})
try:
    schema.validate(data)
except SchemaError as exc:
    # raises AttributeError
    pickle.loads(pickle.dumps(exc))

# case 1 with lambda:
schema = DataFrameSchema({
    "a": Column(int, Check(lambda x: x > 0))
})
try:
    schema.validate(data)
except SchemaError as exc:
    # raises PicklingError
    pickle.loads(pickle.dumps(exc))

# case 2:
schema = DataFrameSchema({
    "a": Column(str)
})
try:
    schema.validate(data)
except SchemaError as exc:
    # raises TypeError during unpickling
    pickle.loads(pickle.dumps(exc))

Expected behavior

None of those should raise an exception. Then the exception would be handed to the parent process in a multiprocessing setting. There is no way to keep the actual data and schema attributes in this case, so they should be replaced.

Desktop (please complete the following information):

  • OS: Ubuntu 20.04
  • Python: 3.7, 3.8

Proposal

  • The unpickling issue can be solved by implementing __reduce__.
  • The problem with unpicklabe content and possibly huge attributes can not be solved while preserving them. My proposal would be to implement __getstate__ to map all attributes of __dict__ to their string representation.

If there is consensus of the desired behavior, I can implement it.

@matthiashuschle matthiashuschle added the bug Something isn't working label Dec 18, 2021
@cosmicBboy
Copy link
Collaborator

Thanks for raising this issue @matthiashuschle!

Your proposal looks good, and thanks in advance for the contribution!

Let me know if you have any questions about dev environment setup.

matthiashuschle added a commit to matthiashuschle/pandera that referenced this issue Dec 30, 2021
cosmicBboy pushed a commit that referenced this issue Dec 31, 2021
* make SchemaError and SchemaErrors picklable

* make ParserError picklable

* refactor #713
@cosmicBboy
Copy link
Collaborator

fixed by #722

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants