Add more compression types for `to_json` #3551

bhavitvyamalik · 2022-01-07T18:25:02Z

This PR adds bz2, xz, and zip (WIP) for to_json. I also plan to add infer like how pandas does it

src/datasets/io/handler.py

bhavitvyamalik · 2022-01-13T20:55:45Z

@lhoestq, I looked into how to compress with zipfile for which few methods exist, let me know which one looks good:

create the file in normal wb mode and then zip it separately
use ZipFile.write_str to write file into the archive. For this we'll need to change how we're writing files from _write method

How pandas handles it is that they have created a wrapper for standard library class ZipFile and allow the returned file-like handle to accept byte strings via write method instead of write_str (purpose was to change the name of function by creating that wrapper)

lhoestq · 2022-01-14T14:03:03Z

sounds not ideal since it creates an intermediary file.
I like pandas' approach. Is it possible to implement 2. using the pandas class ? Or maybe we can have something similar ?

bhavitvyamalik · 2022-01-26T13:55:43Z

Definitely, @lhoestq! I've adapted that from original code and turns out it is faster than gz compression. Apart from that I've also added infer option to automatically infer compression type from path_or_buf given

bhavitvyamalik · 2022-01-26T13:58:58Z

One small thing, currently I'm assuming that user will provide compression extension in path_or_buf. Is it this also possible?
dataset.to_json("from_dataset.json", compression="zip")?
Should I put an assert to ensure the file name provided always has a compression extension?

lhoestq · 2022-01-31T14:03:59Z

Thanks !

One small thing, currently I'm assuming that user will provide compression extension in path_or_buf. Is it this also possible?
dataset.to_json("from_dataset.json", compression="zip")?
Should I put an assert to ensure the file name provided always has a compression extension?

I think it's fine as it is right now :) No need to check the extension of the filename passed to path_or_buf.

lhoestq · 2022-01-31T14:09:34Z

turns out it is faster than gz compression

I think the default compression level of gzip is 9 in python, which is very slow. Maybe we can switch to compression level 6 instead which is faster, like the gzip command on unix

lhoestq · 2022-01-31T14:21:16Z

I found that fsspec has something that may interest you: fsspec.open(..., compression=...). I don't remember if we've already mentioned it or not

It also has zip if I understand correctly ! see https://github.com/fsspec/filesystem_spec/blob/master/fsspec/compression.py#L70

Since fsspec is a dependency of datasets we can use all this :)

Let me know if you prefer using fsspec instead (I haven't tested this yet to write compressed files). IMO it sounds pretty easy to use and it would make the code base simpler

bhavitvyamalik · 2022-02-02T14:06:02Z

Just tried fsspec but I'm not able to write compressed zip files :/
gzip, xz, bz2 are all working fine and it's really simple (no need for FileWriteHandler now!)

src/datasets/io/json.py

bhavitvyamalik · 2022-02-17T19:39:45Z

tests/io/test_json.py

@@ -255,3 +257,15 @@ def test_dataset_to_json_orient_invalidproc(self, dataset):
        with pytest.raises(ValueError):
            with io.BytesIO() as buffer:
                JsonDatasetWriter(dataset, buffer, num_proc=0)
+
+    @pytest.mark.parametrize("compression, extension", [("gzip", "gz"), ("bz2", "bz2"), ("xz", "xz")])


somehow gzip test is failing due to few mismatches

update: instead of reading compressed files and comparing them directly, I uncompressed them using fsspec and then compared. The bug went away!

lhoestq

Thanks ! Just adding an error message in case someone passes compression for a buffer or file-like object:

src/datasets/io/json.py

add more compression types for

a45b454

bhavitvyamalik commented Jan 7, 2022

View reviewed changes

src/datasets/io/handler.py Outdated Show resolved Hide resolved

mariosasko mentioned this pull request Jan 19, 2022

Second concatenation of datasets produces errors #2750

Closed

bhavitvyamalik added 3 commits January 26, 2022 18:45

support 'zip' and 'infer'

33378eb

Merge remote-tracking branch 'origin/master' into compression_json

e482626

make style

026e9d8

bhavitvyamalik marked this pull request as ready for review January 26, 2022 13:52

bhavitvyamalik mentioned this pull request Jan 31, 2022

Allow 'to_json' to run in unordered fashion in order to lower memory footprint #3650

Closed

bhavitvyamalik commented Feb 2, 2022

View reviewed changes

src/datasets/io/json.py Outdated Show resolved Hide resolved

bhavitvyamalik added 2 commits February 18, 2022 01:07

add fsspec and tests

b62529b

Merge remote-tracking branch 'origin/master' into compression_json

ea86bb1

bhavitvyamalik commented Feb 17, 2022

View reviewed changes

bhavitvyamalik added 4 commits February 18, 2022 01:11

make style

c877a41

bug fix

17c0cc5

make style -again

2f98747

add compressed files

6de0719

lhoestq approved these changes Feb 21, 2022

View reviewed changes

src/datasets/io/json.py Show resolved Hide resolved

lhoestq added 2 commits February 21, 2022 16:36

Update src/datasets/io/json.py

651b7a6

typo

de9c509

lhoestq merged commit 290456d into huggingface:master Feb 21, 2022

bhavitvyamalik deleted the compression_json branch July 10, 2022 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more compression types for `to_json` #3551

Add more compression types for `to_json` #3551

bhavitvyamalik commented Jan 7, 2022

bhavitvyamalik commented Jan 13, 2022 •

edited

Loading

lhoestq commented Jan 14, 2022

bhavitvyamalik commented Jan 26, 2022

bhavitvyamalik commented Jan 26, 2022

lhoestq commented Jan 31, 2022

lhoestq commented Jan 31, 2022

lhoestq commented Jan 31, 2022 •

edited

Loading

bhavitvyamalik commented Feb 2, 2022

bhavitvyamalik Feb 17, 2022

bhavitvyamalik Feb 18, 2022

lhoestq left a comment

Add more compression types for to_json #3551

Add more compression types for to_json #3551

Conversation

bhavitvyamalik commented Jan 7, 2022

bhavitvyamalik commented Jan 13, 2022 • edited Loading

lhoestq commented Jan 14, 2022

bhavitvyamalik commented Jan 26, 2022

bhavitvyamalik commented Jan 26, 2022

lhoestq commented Jan 31, 2022

lhoestq commented Jan 31, 2022

lhoestq commented Jan 31, 2022 • edited Loading

bhavitvyamalik commented Feb 2, 2022

bhavitvyamalik Feb 17, 2022

Choose a reason for hiding this comment

bhavitvyamalik Feb 18, 2022

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Add more compression types for `to_json` #3551

Add more compression types for `to_json` #3551

bhavitvyamalik commented Jan 13, 2022 •

edited

Loading

lhoestq commented Jan 31, 2022 •

edited

Loading