Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more compression types for to_json #3551

Merged
merged 12 commits into from
Feb 21, 2022

Conversation

bhavitvyamalik
Copy link
Contributor

This PR adds bz2, xz, and zip (WIP) for to_json. I also plan to add infer like how pandas does it

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented Jan 13, 2022

@lhoestq, I looked into how to compress with zipfile for which few methods exist, let me know which one looks good:

  1. create the file in normal wb mode and then zip it separately
  2. use ZipFile.write_str to write file into the archive. For this we'll need to change how we're writing files from _write method

How pandas handles it is that they have created a wrapper for standard library class ZipFile and allow the returned file-like handle to accept byte strings via write method instead of write_str (purpose was to change the name of function by creating that wrapper)

@lhoestq
Copy link
Member

lhoestq commented Jan 14, 2022

  1. sounds not ideal since it creates an intermediary file.
    I like pandas' approach. Is it possible to implement 2. using the pandas class ? Or maybe we can have something similar ?

@bhavitvyamalik bhavitvyamalik marked this pull request as ready for review January 26, 2022 13:52
@bhavitvyamalik
Copy link
Contributor Author

Definitely, @lhoestq! I've adapted that from original code and turns out it is faster than gz compression. Apart from that I've also added infer option to automatically infer compression type from path_or_buf given

@bhavitvyamalik
Copy link
Contributor Author

One small thing, currently I'm assuming that user will provide compression extension in path_or_buf. Is it this also possible?
dataset.to_json("from_dataset.json", compression="zip")?
Should I put an assert to ensure the file name provided always has a compression extension?

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

Thanks !

One small thing, currently I'm assuming that user will provide compression extension in path_or_buf. Is it this also possible?
dataset.to_json("from_dataset.json", compression="zip")?
Should I put an assert to ensure the file name provided always has a compression extension?

I think it's fine as it is right now :) No need to check the extension of the filename passed to path_or_buf.

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

turns out it is faster than gz compression

I think the default compression level of gzip is 9 in python, which is very slow. Maybe we can switch to compression level 6 instead which is faster, like the gzip command on unix

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

I found that fsspec has something that may interest you: fsspec.open(..., compression=...). I don't remember if we've already mentioned it or not

It also has zip if I understand correctly ! see https://github.com/fsspec/filesystem_spec/blob/master/fsspec/compression.py#L70

Since fsspec is a dependency of datasets we can use all this :)

Let me know if you prefer using fsspec instead (I haven't tested this yet to write compressed files). IMO it sounds pretty easy to use and it would make the code base simpler

@bhavitvyamalik
Copy link
Contributor Author

Just tried fsspec but I'm not able to write compressed zip files :/
gzip, xz, bz2 are all working fine and it's really simple (no need for FileWriteHandler now!)

@@ -255,3 +257,15 @@ def test_dataset_to_json_orient_invalidproc(self, dataset):
with pytest.raises(ValueError):
with io.BytesIO() as buffer:
JsonDatasetWriter(dataset, buffer, num_proc=0)

@pytest.mark.parametrize("compression, extension", [("gzip", "gz"), ("bz2", "bz2"), ("xz", "xz")])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow gzip test is failing due to few mismatches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: instead of reading compressed files and comparing them directly, I uncompressed them using fsspec and then compared. The bug went away!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Just adding an error message in case someone passes compression for a buffer or file-like object:

src/datasets/io/json.py Show resolved Hide resolved
@lhoestq lhoestq merged commit 290456d into huggingface:master Feb 21, 2022
@bhavitvyamalik bhavitvyamalik deleted the compression_json branch July 10, 2022 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants