Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate file sizes in dvc file #3256

Closed
dmpetrov opened this issue Jan 30, 2020 · 8 comments
Closed

Incorporate file sizes in dvc file #3256

dmpetrov opened this issue Jan 30, 2020 · 8 comments
Assignees
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC research

Comments

@dmpetrov
Copy link
Member

If we add file sizes in DVC-files (when we calculate checksum - so, no extra reads) it will help us to show this info in dvc diff/dvc list and other commands with no I/O or computational overhead.

Related to #2982

@dmpetrov dmpetrov added the discussion requires active participation to reach a conclusion label Jan 30, 2020
@shcheklein
Copy link
Member

I like the idea! On the other hand it might complicate the merge, diffs will become bigger? Also, are there any other fields that potentially could be useful (names? modes? type?).

Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

@ghost
Copy link

ghost commented Jan 30, 2020

Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

Yes 🙂

@dmpetrov
Copy link
Member Author

dmpetrov commented Feb 2, 2020

does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

💯

@efiop
Copy link
Contributor

efiop commented Feb 3, 2020

For the record, might be solved in #1871 .

@pmrowla
Copy link
Contributor

pmrowla commented Sep 7, 2020

From #2982

we should store file size in addition to hashes. It will give us an ability to show file sizes in dvc diff.

  • store file size together with checksums (it might be a separate issue)
  • display file size in diff (like metrics diff). It might be disabled by default.

@pmrowla pmrowla added enhancement Enhances DVC product: VSCode Integration with VSCode extension labels Sep 7, 2020
@efiop efiop added the research label Sep 29, 2020
@efiop efiop self-assigned this Oct 15, 2020
efiop added a commit to efiop/dvc that referenced this issue Nov 4, 2020
efiop added a commit that referenced this issue Nov 4, 2020
* dvc: add size for deps/outs

Related to #3256

* dvc: add nfiles for deps/outs

* dvc: put size/nfiles into the hash_info
efiop added a commit to efiop/dvc that referenced this issue Nov 7, 2020
Currently we are converting dir_info to/from lists all the time.
The reason is that dir_info is stored as list of dicts in *.dir files,
but that makes it hard to work with. In addition to that, we will likely
be changing .dir file format in the near future iterative#829, so we need to
abstract away dir_info into something that we won't care how it will be
stored on disk.

Related iterative#3256
Related iterative#4847
efiop added a commit that referenced this issue Nov 7, 2020
Currently we are converting dir_info to/from lists all the time.
The reason is that dir_info is stored as list of dicts in *.dir files,
but that makes it hard to work with. In addition to that, we will likely
be changing .dir file format in the near future #829, so we need to
abstract away dir_info into something that we won't care how it will be
stored on disk.

Related #3256
Related #4847
@MetalBlueberry
Copy link
Contributor

I've noticed that the size of the whole tracked dir is registered in .dvc file. but there is no data for individual files inside the tracked directory. I would like to have this information so I can display a list of files&size inside the directory without downloading any file.

is it possible to add the size to the json file generated to track contents inside the directory?

@efiop
Copy link
Contributor

efiop commented Nov 18, 2020

Hi @MetalBlueberry !

Great question! Indeed, we are thinking about adding size to the .dir cache file, but adding those right now will result in older dvc versions registering it as a cache corruption and also us not being able to self-validate .dir files without filtering them first (md5 of them shouldn't depend on size fields) #4841

We will also add support for these to dvc diff/list/status.

@dberenbaum dberenbaum removed the product: VSCode Integration with VSCode extension label May 18, 2021
@efiop
Copy link
Contributor

efiop commented Dec 8, 2023

Most likely will be replaced by #8884 , but if .dir-s stay in some shape or form we'll redesign them from scratch. So closing this.

@efiop efiop closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC research
Projects
None yet
Development

No branches or pull requests

6 participants