Dropping .dir #8884

dberenbaum · 2023-01-25T19:22:02Z

In #8849, we stopped serializing the directory info in the resulting .dvc/dvc.lock files for cloud versioned remotes. Can we do the same everywhere?

This would help with a bunch of existing issues:

diff gets confused when the object is not in cache #7661 and dvc diff: unexpected output in Github Actions #8875: dvc diff fails to compare refs unless the data associated with those refs has been pulled locally. dvc data status also reports an unknown status when data hasn't been pulled. By having all the files listed in the .dvc/dvc.lock file, it would always be possible to get the granular file info of any commit.
cloud versioning: dvc pull fails after version aware import-url for non version aware s3 remote #8872: dvc pull on an import-url target is now supposed to be able to pull the data directly from the source without having to push a copy to the remote, but it doesn't work for directories because only the high-level directory info is saved to the .dvc file.
Mechanism to update a dataset w/o downloading it first #4657: To modify an existing directory, the whole directory needs to be pulled. Having the granular file info in the .dvc file means that users could delete a file by searching for it in the .dvc file and deleting that entry. This still isn't great UX, but it should be easy to make dvc add/remove work at a granular level by only modifying part of the .dvc file.
merge-driver: failed merges don't return conflict markers #8638: Users have to install a special merge driver because of .dir entries. Even then, a merge conflict becomes hard to troubleshoot because the conflict will not show both .dir entries, and even if it did, there's no easy way to combine them. With granular file info instead of the .dir entries, no merge driver is needed, and merge conflicts could be resolved by editing the file info in the .dvc files.

Automatically pushing and pulling the .dir files from the remote could also solve a lot of these problems, but it seems like a worse UX. It's less transparent, harder for users to manage, and fails when users don't have access to the remote or forgot to push something.

How much do we really need the reference to the .dir file? If necessary, could we serialize that reference somewhere that's not git-tracked, like in a shadow .dvc/tmp/mydataset.dvcfile?

The text was updated successfully, but these errors were encountered:

skshetry · 2023-01-26T03:48:02Z

I would say that we wait how cloud versioning goes before making any decisions here.

skshetry · 2023-01-26T03:48:39Z

Also, yaml parsing is many times slower than json parsing.

daavoo · 2023-01-26T12:14:01Z

Also, yaml parsing is many times slower than json parsing.

Any reason for lock files to be in YAML?

skshetry · 2023-01-26T14:24:06Z

We wanted the lock file to be human-readable too (dvc.yaml, it's other half is well in yaml), so that it's easy to manage, fix conflicts, and be more transparent to user (as it's more readable and easy to parse by humans).

dberenbaum · 2023-01-27T13:56:56Z

Possibly related: https://github.com/iterative/studio/issues/4968. Studio checks the size limit of outputs but not at a granular level. Would it be simpler for them to do this granularly if they had the individual files in dvc.lock?

dberenbaum · 2023-02-28T18:21:59Z

I would say that we wait how cloud versioning goes before making any decisions here.

What would be the harm in at least adding a flag to opt in?

efiop · 2023-03-09T16:33:38Z

I agree that we should just drop it. Especially for imports. This makes much more sense than creating .dir files which poison our cache and are complicated to manage/load.

It seems like it worked out pretty nicely for cloud versioning, so I think we are ready to switch.

dberenbaum · 2023-06-27T16:05:04Z

We have had some reports that dvc.lock parsing is slow for people with many parametrized stages, so that gives me some pause about whether this would work well for large datasets. Let's at least test with some sizeable dataset before we go too deep.

skshetry · 2023-06-27T16:15:24Z

I have tested cloud versioned CelebA dataset, it took 30 seconds to be parsed. It was a .dvc file though.

dberenbaum · 2023-06-27T16:17:03Z

@skshetry Do you know how long it takes currently with .dir and json parsing?

dberenbaum · 2023-06-27T16:19:35Z

Also, how hard would it be to use a more efficient yaml parser?

efiop · 2023-06-27T23:27:36Z

@dberenbaum yaml is pretty bad overall. But json should be significantly better though. This is to the topic of dvc.lock files and maybe either migrating to them (data/artifacts section in dvc.yaml) or maybe using json in data.dvc (json and yaml are easy to distinguish, so shouldn't be a problem, but maybe it is an opportunity to create something new).

We might be able to shave down unneeded features from yaml parser or use something that is doing that for us out of the box, but yaml is just not quite suitable for that many entries.

skshetry · 2023-06-28T06:04:59Z

We use yaml parser written in C. You can check if you have it installed using:

python  -c "import ruamel.yaml; print(ruamel.yaml.__with_libyaml__)"

Here are some benchmarks on yaml/json parsing based on celeba:

Parsing `.dir` file

>>> %timeit _ = json.loads(json_text)
78.5 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit _ = orjson.loads(json_text)
54.7 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that the actual parsing for .dir file is about ~1s due to iterative/dvc-data#350.

Parsing `.dvc` file

Read-only

>>> yaml_text = Path("dataset.dvc").read_text()
>>> %time _ = loads_yaml(yaml_text)
CPU times: user 30.8 s, sys: 103 ms, total: 30.9 s
Wall time: 31 s

Reading .dvc file for update (round-trip)

>>> %time _ = loads_yaml(yaml_text, "rt")
 CPU times: user 3min 30s, sys: 486 ms, total: 3min 31s
Wall time: 3min 31s

Modifying .dvc file with no changes (also involves round-trip)

>>> %%time
    with modify_yaml("dataset.dvc") as d:
       pass
 
CPU times: user 5min 24s, sys: 848 ms, total: 5min 25s
Wall time: 5min 26s

Schema validation

>>> %time _ = COMPILED_SINGLE_STAGE_SCHEMA(d)
CPU times: user 5.52 s, sys: 3.22 ms, total: 5.52 s
Wall time: 5.53 s

Note that we have three times more data in .dvc file than .dir. The size of .dir file is 21MB, and has two item per entry (relpath and md5). The .dvc file spans 1,418,224 lines and is 50MB in size and has more metadata fields (size) and cloud.

- relpath: list_landmarks_align_celeba.csv
  md5: e12a3125253832211b718c810b5fe92f
  size: 9932092
  cloud:
    cloud_versioned:
      etag: 91f987c43b7f37a06be88cd3e7e6dae1
      version_id: dFK7BuMO0mEDHdcqpEcyKJIbiTo3rb3p

{
    "md5": "e12a3125253832211b718c810b5fe92f",
    "relpath": "list_landmarks_align_celeba.csv"
},

YAML is supposed to be a human-readable format, so it is going to be slower, and is not meant to be abused in this way.

dberenbaum added discussion requires active participation to reach a conclusion A: data-sync Related to dvc get/fetch/import/pull/push A: data-management Related to dvc add/checkout/commit/move/remove labels Jan 25, 2023

dberenbaum mentioned this issue Feb 3, 2023

cloud versioning: dvc pull fails after version aware import-url for non version aware s3 remote #8872

Closed

efiop mentioned this issue Mar 9, 2023

fetch: introduce --tree-only #9140

Closed

dberenbaum added the p2-medium Medium priority, should be done, but less important label May 20, 2023

pmrowla mentioned this issue Jun 27, 2023

Adding size information to artifact information of .dir files #9668

Closed

efiop mentioned this issue Dec 8, 2023

Incorporate file sizes in dvc file #3256

Closed

dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping .dir #8884

Dropping .dir #8884

dberenbaum commented Jan 25, 2023 •

edited

Loading

skshetry commented Jan 26, 2023

skshetry commented Jan 26, 2023

daavoo commented Jan 26, 2023

skshetry commented Jan 26, 2023

dberenbaum commented Jan 27, 2023

dberenbaum commented Feb 28, 2023

efiop commented Mar 9, 2023 •

edited

Loading

dberenbaum commented Jun 27, 2023

skshetry commented Jun 27, 2023

dberenbaum commented Jun 27, 2023

dberenbaum commented Jun 27, 2023

efiop commented Jun 27, 2023 •

edited

Loading

skshetry commented Jun 28, 2023 •

edited

Loading

Dropping .dir #8884

Dropping .dir #8884

Comments

dberenbaum commented Jan 25, 2023 • edited Loading

skshetry commented Jan 26, 2023

skshetry commented Jan 26, 2023

daavoo commented Jan 26, 2023

skshetry commented Jan 26, 2023

dberenbaum commented Jan 27, 2023

dberenbaum commented Feb 28, 2023

efiop commented Mar 9, 2023 • edited Loading

dberenbaum commented Jun 27, 2023

skshetry commented Jun 27, 2023

dberenbaum commented Jun 27, 2023

dberenbaum commented Jun 27, 2023

efiop commented Jun 27, 2023 • edited Loading

skshetry commented Jun 28, 2023 • edited Loading

Parsing .dir file

Parsing .dvc file

dberenbaum commented Jan 25, 2023 •

edited

Loading

efiop commented Mar 9, 2023 •

edited

Loading

efiop commented Jun 27, 2023 •

edited

Loading

skshetry commented Jun 28, 2023 •

edited

Loading

Parsing `.dir` file

Parsing `.dvc` file