Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cloudpickle to 0.6.1 #3076

Closed
wants to merge 5 commits into from
Closed

Conversation

suquark
Copy link
Member

@suquark suquark commented Oct 17, 2018

What do these changes do?

This PR is used to fix problems found in updating cloudpickle.

Related issue number

#2685
#2928

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8705/
Test FAILed.

@suquark
Copy link
Member Author

suquark commented Oct 18, 2018

It's so weird. recursion_test always fails. I tried in different systems and cannot reproduce such errors. I am not sure what has happened.

@suquark
Copy link
Member Author

suquark commented Oct 18, 2018

@robertnishihara We haven't enabled debug mode for our travis CI, right?
https://docs.travis-ci.com/user/running-build-in-debug-mode/
Could you send a email to them to enable debug mode? It could be much easier to solve some problems with debug mode (I often did such things in CircleCI).

@robertnishihara
Copy link
Collaborator

@suquark, I think that debug mode is enabled. However, I've always found it difficult to use. Have you tried both linux and mac?

@suquark
Copy link
Member Author

suquark commented Oct 18, 2018

Yes. I have tried. It will succeed when running the recursive test alone.

@suquark
Copy link
Member Author

suquark commented Oct 18, 2018

Currently, I think a workaround is just keep the changes we need in cloudpickle since the original version have no problems. I will remove some suspicious code.

@robertnishihara
Copy link
Collaborator

Did you run it with python -m pytest -v test/recursion_test.py?

The error is very interesting, I wonder if it's related to 9a52ba31f270fb6888d31c012e14a841f8b82acb or 8eaf637e78733fe5b4c295d9204dc6dcc76fb342.

It's also possible that we need to undo this code

# Work around limitations of Python pickling.
function = remote_function._function
function_name_global_valid = function.__name__ in function.__globals__
function_name_global_value = function.__globals__.get(
function.__name__)
# Allow the function to reference itself as a global variable
if not is_cython(function):
function.__globals__[function.__name__] = remote_function
try:
pickled_function = pickle.dumps(function)
finally:
# Undo our changes
if function_name_global_valid:
function.__globals__[function.__name__] = (
function_name_global_value)
else:
del function.__globals__[function.__name__]

@suquark
Copy link
Member Author

suquark commented Oct 18, 2018

Yes. I run exactly the command. I will do a binary search to identify the bug.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8742/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8744/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8745/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8751/
Test FAILed.

@suquark
Copy link
Member Author

suquark commented Oct 19, 2018

jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8758/
Test PASSed.

@suquark suquark changed the title WIP: Update cloudpickle to 0.6.1 Update cloudpickle to 0.6.1 Oct 19, 2018
@suquark
Copy link
Member Author

suquark commented Oct 19, 2018

It is repaired by removing global-related code in cloudpickle. Comments are left in code to mark those removals.

@robertnishihara
Copy link
Collaborator

Thanks @suquark, is it a bug in cloudpickle? If so, can we submit a PR to fix cloudpickle?

@suquark
Copy link
Member Author

suquark commented Oct 19, 2018

@robertnishihara I don't think it is a bug. Pickling python objects itself is not well-defined because what are needed to be serialized and what should be kept as references really depend on the application. So we just keep what we need from their code.
And we could ask cloudpickle to control the 'level' of serialization just like compilers to avoid some over-serialization. But it will certainly take more time. I suggest merge this PR first.

@mitar
Copy link
Member

mitar commented Oct 22, 2018

It would be great to get this fixed soon. We have problems because of this cloud pickle issues.

I see that there are further changes made by @suquark here, to fix problems in cloud pickle which are detected by Ray CI. If I understand correctly, they are not really bugs but it is that our needs from cloud pickle is different than what default logic they have provides. So or we maintain patched version or we get them to add a switch so that we can disable things which are breaking our use case.

@robertnishihara
Copy link
Collaborator

@mitar, I agree it would be great to get these issues resolved soon. Maintain our own version of cloudpickle will be a nightmare, so we should stick with the official code base.

@suquark can you explain what situations cause cloudpickle to "overserialize" things? What can we do to avoid triggering those cases?

@suquark
Copy link
Member Author

suquark commented Oct 23, 2018

@robertnishihara It's really hard to give a simple example. But here are some points:

  1. When restoring the global variables, in Ray we should always give serialized global variables higher priority (that is, always overriding existing global variables). The new changes in cloudpickle gives higher priority to existing global variables.
  2. In Ray, do not try to infer global variables from the __module__ attribute of a function. The new changes in cloudpickle try to make use of __module__ when it fails to fetch global variables.

If you insist on keeping sync with the official cloudpickle, I could make a PR to introduce some switches controlling the behavior. However, this will introduce new issues. For example, should this switch belong to the module or belong to the CloudPickler class? Should it be controlled by the serializer or the deserializer?

@mitar
Copy link
Member

mitar commented Oct 23, 2018

Are those things also something any user of Ray would have to think about?

Personally, I really do not see this pickling of much benefit to us. This is useful in scripting mode, but for production, it would just be simple to send over a full Python path of a function to call and expect it to be there on a worker available, through a regular Python importing.

@robertnishihara
Copy link
Collaborator

@mitar, separately I think we will provide a mode doing something like what you outlined above, where modules are available on each machine and don't need to be serialized. However, even in that case it will be hard to avoid pickle entirely, e.g., for shipping around custom serializers and things like that.

@mitar
Copy link
Member

mitar commented Oct 23, 2018

Even for custom serializers I would expect the code to be already there.

@mitar
Copy link
Member

mitar commented Oct 29, 2018

So, how can we make progress here? This is pretty important for us.

@raulchen
Copy link
Contributor

raulchen commented Nov 6, 2018

does this PR still have problems?

@robertnishihara
Copy link
Collaborator

@raulchen this problem with this PR is that it's not actually using the upstream cloudpickle. Instead @suquark applied some patches. There is some discussion on cloudpipe/cloudpickle#214 about the best way to patch the upstream cloudpickle.

@nmatthews-asapp
Copy link
Contributor

Looks like cloudpipe/cloudpickle#240 may help here. I assume it will be in 0.7.1 of cloudpickle

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 13, 2019 via email

@pcmoritz pcmoritz closed this Feb 13, 2019
@suquark suquark deleted the cloudpickle branch October 3, 2019 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants