You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description SchemaError (and probably SchemaErrors) have a couple of problems that make it impossible to use them with pickle:
Pickling breaks when the schema attribute contains Check objects with lambdas or local functions from Check.isin or similar.
Unpickling breaks always, as the signature differs from Exception and has more than one required positional argument.
This is relevant, because when a subprocess raises an uncaught exception - which might be the intention of using pandera - the exception object is part of the return value, which is piped to the parent process using pickle. This usecase also raises the third issue, that the size limit of these pipes is 2GiB per pickled object, and the data contained in the exception might easily become larger.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample
importpickleimportpandasaspdfrompanderaimportDataFrameSchema, Check, Columnfrompandera.errorsimportSchemaErrordata=pd.DataFrame({"a": [-1, 0, 1]})
# case 1 with Check.isin:schema=DataFrameSchema({
"a": Column(int, Check.isin([0, 1]))
})
try:
schema.validate(data)
exceptSchemaErrorasexc:
# raises AttributeErrorpickle.loads(pickle.dumps(exc))
# case 1 with lambda:schema=DataFrameSchema({
"a": Column(int, Check(lambdax: x>0))
})
try:
schema.validate(data)
exceptSchemaErrorasexc:
# raises PicklingErrorpickle.loads(pickle.dumps(exc))
# case 2:schema=DataFrameSchema({
"a": Column(str)
})
try:
schema.validate(data)
exceptSchemaErrorasexc:
# raises TypeError during unpicklingpickle.loads(pickle.dumps(exc))
Expected behavior
None of those should raise an exception. Then the exception would be handed to the parent process in a multiprocessing setting. There is no way to keep the actual data and schema attributes in this case, so they should be replaced.
Desktop (please complete the following information):
OS: Ubuntu 20.04
Python: 3.7, 3.8
Proposal
The unpickling issue can be solved by implementing __reduce__.
The problem with unpicklabe content and possibly huge attributes can not be solved while preserving them. My proposal would be to implement __getstate__ to map all attributes of __dict__ to their string representation.
If there is consensus of the desired behavior, I can implement it.
The text was updated successfully, but these errors were encountered:
Description
SchemaError
(and probablySchemaErrors
) have a couple of problems that make it impossible to use them with pickle:schema
attribute containsCheck
objects with lambdas or local functions fromCheck.isin
or similar.Exception
and has more than one required positional argument.This is relevant, because when a subprocess raises an uncaught exception - which might be the intention of using pandera - the exception object is part of the return value, which is piped to the parent process using pickle. This usecase also raises the third issue, that the size limit of these pipes is 2GiB per pickled object, and the data contained in the exception might easily become larger.
Code Sample
Expected behavior
None of those should raise an exception. Then the exception would be handed to the parent process in a multiprocessing setting. There is no way to keep the actual data and schema attributes in this case, so they should be replaced.
Desktop (please complete the following information):
Proposal
__reduce__
.__getstate__
to map all attributes of__dict__
to their string representation.If there is consensus of the desired behavior, I can implement it.
The text was updated successfully, but these errors were encountered: