Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote storage auto-compression #1239

Closed
BartekRoszak opened this issue Oct 18, 2018 · 14 comments
Closed

remote storage auto-compression #1239

BartekRoszak opened this issue Oct 18, 2018 · 14 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Milestone

Comments

@BartekRoszak
Copy link

Do you consider to add an auto-compression feature when sending something to remote storage?
It could save a lot of memory and speed up sending files to remote.

@efiop
Copy link
Contributor

efiop commented Oct 18, 2018

Hi @AratorField !

Great idea! We are thinking about ways to optimize cache size(e.g. #829), but just didn't get to working on it yet. Let's leave this issue opened as a remainder.

@efiop efiop added the enhancement Enhances DVC label Oct 18, 2018
@efiop efiop added this to the Queue milestone Oct 18, 2018
@efiop efiop added feature request Requesting a new feature p4-not-important labels Jul 23, 2019
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 25, 2019

Ivan and I were just talking about this, while reviewing the current https://dvc.org/doc/get-started/example-versioning which dvc get's a couple different ZIP files containing images: data.zip with 1000 images and new-labels.zip with another 1000 images of the same kind (so they're actually parts of the same dataset).

For a new document we're writing, we will reuse this image dataset but as an uncompressed directory with 2 versions. the first one with the contents of data.zip only (1000 images) and the 2nd version with data.zip + new-labels.zip (2000 images).

Besides the fact that compression probably doesn't help much with these images anyway (but it would with tabular data e.g. CSV or JSON), this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta.

Anyway, this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs (except for backups perhaps).

Thoughts, @iterative/engineering ?

@efiop efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019
@ghost ghost pinned this issue Sep 26, 2019
@ghost ghost unpinned this issue Sep 26, 2019
@pared
Copy link
Contributor

pared commented Oct 9, 2019

Sorry @jorgeorpinel didn't notice it before.
What do you mean by

this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta

Isn't it already happening for directories?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 9, 2019

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Oct 10, 2019
@Seanny123
Copy link

Seanny123 commented May 19, 2020

I would like to try to rephrase this issue to make sure I understand it.

Given large directories can take a long time to upload, they should be automatically be compressed without the user's intervention? So the process would be something like:

  1. User does dvc add my_big_folder

  2. Some DVC settings determine under what circumstances a folder should be compressed and how.

  3. User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

  4. If the user modifies my_big_folder, the md5 hash will be different, thus before updating, the whole folder will need to be compressed again? Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

@Seanny123
Copy link

Seanny123 commented May 19, 2020

Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

It turns out modifying a zip-file without re-compressing is documented in the man-page of zip.

@jorgeorpinel
Copy link
Contributor

  1. User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

DVC actually uploads files individually so it would probably not compress entire directories, but your 2 previous steps seem good.

  1. If the user modifies my_big_folder, the md5 will be different, thus before updating, the whole folder will need to be compressed again?

Again, since files are checked and down/uploaded individually, this shouldn't be such a big deal. But thanks for the reference on updating zipped content.

I'm not sure what compression algorithm would be best but TBH I have the impression ZIP won't be the top pick.

@kenahoo
Copy link

kenahoo commented Jan 5, 2021

Hi all - was this ever implemented in DVC? If not, what's the best practice for checking in large-ish CSV files that are dependencies for pipelines?

@efiop
Copy link
Contributor

efiop commented Jan 5, 2021

@kenahoo It wasn't, but we are considering it in #829 , which we are doing active research for right now.

If not, what's the best practice for checking in large-ish CSV files that are dependencies for pipelines?

Not sure I understand your question, could you elaborate, please?

@kenahoo
Copy link

kenahoo commented Jan 5, 2021

Not sure I understand your question, could you elaborate, please?

I have some large CSV files that are input data to some processes defined in dvc.yaml. They take a while to transfer back & forth between environments because they're stored raw, not compressed. I could store them in DVC as .csv.gz files instead, and have the processes take care of uncompressing as needed, but that does push the complexity into those processes. Wondering what other people do.

@shcheklein
Copy link
Member

@kenahoo what remote storage do you use? I wonder if it's possible to enable compression/decompression on the storage level for some remotes?

@jorgeorpinel
Copy link
Contributor

I think that this could be a great optional/plugin (dvcx?) feature even for custom/shared, external cache dirs. Users with inefficient data formats could enable some config option e.g. cache.compression gz and let DVC do the rest. Maybe it could even apply to the local cache as well if the effective cache.type is copy.

Having to uncompress cached files to compare contents hash values (e.g. to check for changes @ status) sounds non-ideal though but I suppose compression formats already contain such/similar metadata.

@papiot
Copy link

papiot commented Feb 8, 2021

I have some large CSV files that are input data to some processes defined in dvc.yaml. They take a while to transfer back & forth... I could store them in DVC as .csv.gz files instead, and have the processes take care of uncompressing as needed, but that does push the complexity into those processes.

Currently running into an exact issue. Will keep an eye on this feature.

@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

Closing in favor of #829 , as it will be part of the mechanism (kinda like git pack files).

@efiop efiop closed this as completed Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

8 participants