-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote storage auto-compression #1239
Comments
Hi @AratorField ! Great idea! We are thinking about ways to optimize cache size(e.g. #829), but just didn't get to working on it yet. Let's leave this issue opened as a remainder. |
Ivan and I were just talking about this, while reviewing the current https://dvc.org/doc/get-started/example-versioning which For a new document we're writing, we will reuse this image dataset but as an uncompressed directory with 2 versions. the first one with the contents of data.zip only (1000 images) and the 2nd version with data.zip + new-labels.zip (2000 images). Besides the fact that compression probably doesn't help much with these images anyway (but it would with tabular data e.g. CSV or JSON), this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta. Anyway, this is all to say that automatic compression would be a very natural feature to have (I would call this Thoughts, @iterative/engineering ? |
Sorry @jorgeorpinel didn't notice it before.
Isn't it already happening for directories? |
Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:
|
I would like to try to rephrase this issue to make sure I understand it. Given large directories can take a long time to upload, they should be automatically be compressed without the user's intervention? So the process would be something like:
|
It turns out modifying a zip-file without re-compressing is documented in the man-page of |
DVC actually uploads files individually so it would probably not compress entire directories, but your 2 previous steps seem good.
Again, since files are checked and down/uploaded individually, this shouldn't be such a big deal. But thanks for the reference on updating zipped content. I'm not sure what compression algorithm would be best but TBH I have the impression ZIP won't be the top pick. |
Hi all - was this ever implemented in DVC? If not, what's the best practice for checking in large-ish CSV files that are dependencies for pipelines? |
I have some large CSV files that are input data to some processes defined in |
@kenahoo what remote storage do you use? I wonder if it's possible to enable compression/decompression on the storage level for some remotes? |
I think that this could be a great optional/plugin (dvcx?) feature even for custom/shared, external cache dirs. Users with inefficient data formats could enable some config option e.g. Having to uncompress cached files to compare contents hash values (e.g. to check for changes @ |
Currently running into an exact issue. Will keep an eye on this feature. |
Closing in favor of #829 , as it will be part of the mechanism (kinda like git pack files). |
Do you consider to add an auto-compression feature when sending something to remote storage?
It could save a lot of memory and speed up sending files to remote.
The text was updated successfully, but these errors were encountered: