Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use highest available pickle protocol when serializing #737

Merged
merged 3 commits into from
Mar 24, 2022

Conversation

rbetz
Copy link
Contributor

@rbetz rbetz commented Feb 11, 2022

Pickle protocol 5 allows for zero-copy pickling of numpy arrays, but protocol version 4 remains the default for Python 3.8 onwards. When loading datasets with large numpy arrays and reader_pool_type="process", using protocol 5 significantly improves serialization time and overall performance.

Protocol 5 is only available in Python 3.8 and onwards, so for backwards and forwards compatibility with future pickle protocol improvements we specify pickle.HIGHEST_PROTOCOL.

Small test case

import time
import numpy as np
from petastorm.reader_impl.pickle_serializer import PickleSerializer

s = PickleSerializer()
data = [np.random.random((1000, 1000)) for _ in range(10)]
t = time.time()
for x in range(100):
  _ = s.deserialize(s.serialize(data))
print(time.time() - t)

master:
11.40 sec

This PR:
8.78 sec

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2022

CLA assistant check
All committers have signed the CLA.

@selitvin
Copy link
Collaborator

The change looks good to me. Can you please update docs/release-notes.rst and accept the CLA? Once we land it, I can cut a new release.

@codecov
Copy link

codecov bot commented Feb 11, 2022

Codecov Report

Merging #737 (5a73d49) into master (76a36fe) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #737   +/-   ##
=======================================
  Coverage   86.27%   86.27%           
=======================================
  Files          85       85           
  Lines        5084     5084           
  Branches      787      787           
=======================================
  Hits         4386     4386           
  Misses        559      559           
  Partials      139      139           
Impacted Files Coverage Δ
petastorm/reader_impl/pickle_serializer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76a36fe...5a73d49. Read the comment docs.

@selitvin selitvin self-requested a review February 14, 2022 18:37
@rbetz
Copy link
Contributor Author

rbetz commented Mar 14, 2022

bump... Is there something I need to do to get the unit tests to run?

@selitvin
Copy link
Collaborator

bump... Is there something I need to do to get the unit tests to run?

Sorry. Did not notice you were blocked on the tests. I had to approve the run for a user's first time. Should be ok going forward. Also, note there is a flaky test (some test that has to do with reading from a partitioned dataset). Feel free to rerunning the build if you bump into it.

@rbetz
Copy link
Contributor Author

rbetz commented Mar 22, 2022

@selitvin I think I need permission to run the tests again. They did all pass, except for the ones you mentioned were flaky.

@selitvin
Copy link
Collaborator

That's weird. I thought it I am expected to approve running tests only for the first time. Will see if I can reconfigure this.

@rbetz
Copy link
Contributor Author

rbetz commented Mar 23, 2022

Thanks for your help. Looks like the tests passed this time, so it's ready to merge :)

@selitvin
Copy link
Collaborator

Beautiful! Thank you for the PR.

@selitvin selitvin merged commit 26e03c7 into uber:master Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants