remote storage auto-compression #1239

BartekRoszak · 2018-10-18T20:29:34Z

Do you consider to add an auto-compression feature when sending something to remote storage?
It could save a lot of memory and speed up sending files to remote.

efiop · 2018-10-18T21:24:41Z

Hi @AratorField !

Great idea! We are thinking about ways to optimize cache size(e.g. #829), but just didn't get to working on it yet. Let's leave this issue opened as a remainder.

jorgeorpinel · 2019-09-25T16:58:52Z

Ivan and I were just talking about this, while reviewing the current https://dvc.org/doc/get-started/example-versioning which dvc get's a couple different ZIP files containing images: data.zip with 1000 images and new-labels.zip with another 1000 images of the same kind (so they're actually parts of the same dataset).

For a new document we're writing, we will reuse this image dataset but as an uncompressed directory with 2 versions. the first one with the contents of data.zip only (1000 images) and the 2nd version with data.zip + new-labels.zip (2000 images).

Besides the fact that compression probably doesn't help much with these images anyway (but it would with tabular data e.g. CSV or JSON), this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta.

Anyway, this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs (except for backups perhaps).

Thoughts, @iterative/engineering ?

pared · 2019-10-09T08:46:48Z

Sorry @jorgeorpinel didn't notice it before.
What do you mean by

this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta

Isn't it already happening for directories?

jorgeorpinel · 2019-10-09T18:36:02Z

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

Seanny123 · 2020-05-19T12:30:57Z

I would like to try to rephrase this issue to make sure I understand it.

Given large directories can take a long time to upload, they should be automatically be compressed without the user's intervention? So the process would be something like:

User does dvc add my_big_folder
Some DVC settings determine under what circumstances a folder should be compressed and how.
User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.
If the user modifies my_big_folder, the md5 hash will be different, thus before updating, the whole folder will need to be compressed again? Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

Seanny123 · 2020-05-19T13:18:28Z

Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

It turns out modifying a zip-file without re-compressing is documented in the man-page of zip.

jorgeorpinel · 2020-05-19T17:14:39Z

User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

DVC actually uploads files individually so it would probably not compress entire directories, but your 2 previous steps seem good.

If the user modifies my_big_folder, the md5 will be different, thus before updating, the whole folder will need to be compressed again?

Again, since files are checked and down/uploaded individually, this shouldn't be such a big deal. But thanks for the reference on updating zipped content.

I'm not sure what compression algorithm would be best but TBH I have the impression ZIP won't be the top pick.

kenahoo · 2021-01-05T00:00:17Z

Hi all - was this ever implemented in DVC? If not, what's the best practice for checking in large-ish CSV files that are dependencies for pipelines?

efiop · 2021-01-05T04:44:04Z

@kenahoo It wasn't, but we are considering it in #829 , which we are doing active research for right now.

If not, what's the best practice for checking in large-ish CSV files that are dependencies for pipelines?

Not sure I understand your question, could you elaborate, please?

kenahoo · 2021-01-05T23:33:40Z

Not sure I understand your question, could you elaborate, please?

I have some large CSV files that are input data to some processes defined in dvc.yaml. They take a while to transfer back & forth between environments because they're stored raw, not compressed. I could store them in DVC as .csv.gz files instead, and have the processes take care of uncompressing as needed, but that does push the complexity into those processes. Wondering what other people do.

shcheklein · 2021-01-05T23:38:12Z

@kenahoo what remote storage do you use? I wonder if it's possible to enable compression/decompression on the storage level for some remotes?

jorgeorpinel · 2021-01-19T15:54:48Z

I think that this could be a great optional/plugin (dvcx?) feature even for custom/shared, external cache dirs. Users with inefficient data formats could enable some config option e.g. cache.compression gz and let DVC do the rest. Maybe it could even apply to the local cache as well if the effective cache.type is copy.

Having to uncompress cached files to compare contents hash values (e.g. to check for changes @ status) sounds non-ideal though but I suppose compression formats already contain such/similar metadata.

papiot · 2021-02-08T07:53:11Z

I have some large CSV files that are input data to some processes defined in dvc.yaml. They take a while to transfer back & forth... I could store them in DVC as .csv.gz files instead, and have the processes take care of uncompressing as needed, but that does push the complexity into those processes.

Currently running into an exact issue. Will keep an eye on this feature.

efiop · 2021-10-08T18:30:25Z

Closing in favor of #829 , as it will be part of the mechanism (kinda like git pack files).

efiop added the enhancement Enhances DVC label Oct 18, 2018

efiop added this to the Queue milestone Oct 18, 2018

efiop added feature request Requesting a new feature p4-not-important labels Jul 23, 2019

efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019

ghost pinned this issue Sep 26, 2019

ghost unpinned this issue Sep 26, 2019

jorgeorpinel mentioned this issue Oct 9, 2019

best-practice: dataset partitioning and/or data compression? iterative/dvc.org#682

Closed

efiop added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Oct 10, 2019

dmpetrov mentioned this issue Jun 14, 2020

Introducing cache types: data, metrics and plots, run-cache and per-file #4040

Closed

efiop closed this as completed Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote storage auto-compression #1239

remote storage auto-compression #1239

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018

jorgeorpinel commented Sep 25, 2019 •

edited

Loading

pared commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

Seanny123 commented May 19, 2020 •

edited by jorgeorpinel

Loading

Seanny123 commented May 19, 2020 •

edited by jorgeorpinel

Loading

jorgeorpinel commented May 19, 2020

kenahoo commented Jan 5, 2021

efiop commented Jan 5, 2021

kenahoo commented Jan 5, 2021

shcheklein commented Jan 5, 2021

jorgeorpinel commented Jan 19, 2021

papiot commented Feb 8, 2021 •

edited by jorgeorpinel

Loading

efiop commented Oct 8, 2021

remote storage auto-compression #1239

remote storage auto-compression #1239

Comments

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018

jorgeorpinel commented Sep 25, 2019 • edited Loading

pared commented Oct 9, 2019 • edited Loading

jorgeorpinel commented Oct 9, 2019 • edited Loading

Seanny123 commented May 19, 2020 • edited by jorgeorpinel Loading

Seanny123 commented May 19, 2020 • edited by jorgeorpinel Loading

jorgeorpinel commented May 19, 2020

kenahoo commented Jan 5, 2021

efiop commented Jan 5, 2021

kenahoo commented Jan 5, 2021

shcheklein commented Jan 5, 2021

jorgeorpinel commented Jan 19, 2021

papiot commented Feb 8, 2021 • edited by jorgeorpinel Loading

efiop commented Oct 8, 2021

jorgeorpinel commented Sep 25, 2019 •

edited

Loading

pared commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

Seanny123 commented May 19, 2020 •

edited by jorgeorpinel

Loading

Seanny123 commented May 19, 2020 •

edited by jorgeorpinel

Loading

papiot commented Feb 8, 2021 •

edited by jorgeorpinel

Loading