Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support passing compression args to gzip and bz2 #33398

Merged
merged 8 commits into from
Apr 10, 2020

Conversation

jessefarnham
Copy link
Contributor

@jessefarnham jessefarnham commented Apr 8, 2020

This commit closes #33196 but takes a more generic approach than the suggested
solution. Instead of providing a 'fast' kwarg or global compression level
setting, this commit extends the ability to pass compression settings as a
dict to the gzip and bz2 compression methods. In this way, if the user
wants faster compression, they can pass
compression={'method': 'gzip', 'compresslevel'=1} rather than
just compression='gzip'.

Note: For the API to be consistent when passing paths vs. filelikes, GZipFile and gzip2.open() must accept the same kwargs.

@jessefarnham
Copy link
Contributor Author

jessefarnham commented Apr 8, 2020

One question to consider: Instead of providing the ability to provide compression level args on a case-by case basis depending on protocol (originally just zip, now gzip and bz2 as well), would it be better to simply provide this ability to all zip protocols? I don't see anything in the code that prevents this--we would simply add **compression_args to each protocol's open() method or file constructor.

@jessefarnham jessefarnham force-pushed the enh-33196 branch 2 times, most recently from d3fe3fb to a7ee476 Compare April 8, 2020 15:42
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

)
@pytest.mark.parametrize("method", ["to_pickle", "to_json", "to_csv"])
def test_gzip_compression_level_path(obj, method):
"""GH#33398 Ideally this test should be repeated for bz2 as well,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another test we can do to ensure we pass the keyword args through? If even just that the call works I think OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can have a test that simply asserts that the call succeeds without asserting anything about size. I'll add.

@WillAyd WillAyd added the IO Data IO issues that don't fit into a more specific label label Apr 8, 2020
@pep8speaks
Copy link

pep8speaks commented Apr 8, 2020

Hello @jessefarnham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-10 20:39:31 UTC

@@ -397,16 +403,18 @@ def get_handle(
# GZ Compression
if compression == "gzip":
if is_path:
f = gzip.open(path_or_buf, mode)
f = gzip.open(path_or_buf, mode, **compression_args) # type: ignore
Copy link
Contributor Author

@jessefarnham jessefarnham Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I added these and the below mypy ignore flags. The problem is that we're passing kwargs to functions that don't take kwargs, which is not typesafe. Both of the alternatives I could think of seem worse:

  • Changing the called function signatures to take an unused **kwargs parameter.
  • Explicitly extracting each parameter from compression_args and passing it individually, thereby needing to duplicate the default parameter values.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! just a comment about docs

pandas/core/generic.py Show resolved Hide resolved
@jreback jreback added this to the 1.1 milestone Apr 8, 2020
@jreback jreback requested a review from gfyoung April 8, 2020 20:43
Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other comments notwithstanding, LGTM. Nice work!

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @jreback

@WillAyd
Copy link
Member

WillAyd commented Apr 10, 2020

@jessefarnham looks like there is a merge conflict - can you fix that up?

@jessefarnham
Copy link
Contributor Author

@jessefarnham looks like there is a merge conflict - can you fix that up?

Fixed

@jessefarnham
Copy link
Contributor Author

@WillAyd Conflict resolved, but check seems to be failing because linting on master is currently broken

@WillAyd
Copy link
Member

WillAyd commented Apr 10, 2020

Gotcha; should be fixed in #33461 so can re-merge after that gets into master

This commit closes GH#33196 but takes a more generic approach than the suggested
solution. Instead of providing a 'fast' kwarg or global compression level
setting, this commit extends the ability to pass compression settings as a
dict to the gzip and bz2 compression methods. In this way, if the user
wants faster compression, they can pass
compression={'method': 'gzip', 'compresslevel'=1} rather than
just compression='gzip'.

Note: For the API to be consistent when passing  paths vs. filelikes, GZipFile
and gzip2.open() must accept the same kwargs.
@jessefarnham
Copy link
Contributor Author

Gotcha; should be fixed in #33461 so can re-merge after that gets into master

All set, ready for merge to master.

@jreback jreback merged commit 7456fc5 into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @jessefarnham very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_csv / to_pickle optionally use fast gzip compressionlevel=1
5 participants