introduce a readonly property for better parallelization #4979

johnnychen94 · 2020-11-26T19:26:43Z

Training a bunch of models using different parameters now becomes much easier with the parameterization feature, this is a great improvement by deduplicating dvc.yml, but there's still one thing that blocks us from training multiple models asynchronously.

stages:
  train:
    foreach: ${models}
    do:
      cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name} ${item.config}
      deps:
      - data/train/data.h5
      - train
      outs:
      - models/${item.name}

Naturally, we want to schedule multiple jobs asynchronously to different devices without error, e.g.,:

mkdir log
for i in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$i dvc repro train@model_$i > log/model_$i.txt 2>&1 &
    sleep 2
done

but since data/train/train.h5 will be locked and thus we can only run one job at the same time. (we could work around it by creating multiple copies/symlinks but that's not elegant...)

I'm wondering if it's possible to introduce a looser version of read lock that's specified by users from dvc.yaml, e.g.,

 stages:
   train:
     foreach: ${models}
     do:
       cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name}
       deps:
-      - data/train/data.h5
+      - data/train/data.h5:readonly
       - train
       outs:
       - models/${item.name}

When a user adds this property, he's explicitly saying that "okay I plan to use this in a read-only way and I'll take responsibility for whatever bugs it may occur due to my impropriate usage", then dvc could choose to not add an entry for them in rwlock, which enables concurrency.

I'm not sure if DVC has plans to give native support for concurrent job scheduling. With #4976, it can be very promising if dvc repro train@* -s schedules multiple jobs in parallel.

It can be nice to also support environment variable passing, but it's also doable by passing params.yml's values to language's internal utils (e.g., os.environ["CUDA_VISIBLE_DEVICES"]=config["gpu_device"]).

The text was updated successfully, but these errors were encountered:

skshetry · 2020-11-27T02:16:44Z

@johnnychen94, great feature request. It should be done similar to the outs section:

deps:
  - data/train/data.h5:
       read_only: True

I don't like that it complicates deps section though.

I'm not sure if DVC has plans to give native support for concurrent job scheduling

We don't have any plans beyond #755 in the short term, which will only wait for locks rather than failing hard. And, we don't have the capacity to work on it right now, so contributions on it will be welcomed.

Regarding scheduling, we are introducing experiments that has good support for scheduling (--queue) and parallelization (--jobs). It is also an opt-only, experimental feature, but you can use same glob and foreach-group-run feature in it already with #4976.

Xyand · 2020-11-29T18:15:21Z

Why isn't it the default behavior?

I'm new to DVC, but I was under the impression that deps are inputs (read) to the stage while outs get modified (write). Why can't multiple dvc runs read-lock the same dep?

johnnychen94 · 2020-11-30T06:42:12Z

I was under the impression that deps are inputs (read) to the stage while outs get modified (write).

I thought the main reason is to prevent the dvc caches from being polluted, but then I realize the following anti-pattern actually works:

stages:
  prepare:
    cmd: echo "generate new data" > data.txt
    outs:
    - data.txt
  train:
    cmd: chmod u+w data.txt && echo "accidentally modified the data" >> data.txt
    deps:
    - data.txt

By adding readonly property, we can add eager md5 checks on the readonly input data, and stop the repro when md5 doesn't match.

It also makes sense to me to (IIUC this is still compatible):

make inputs readonly the default behavior
do md5 checks for inputs after each stage, throw a warning when md5 checks fail

`dvc version`

DVC version: 1.10.1+abb78c.mod
---------------------------------
Platform: Python 3.8.3 on Linux-5.4.0-54-generic-x86_64-with-glibc2.10
Supports: All remotes
Cache types: symlink
Cache directory: nfs4 on storage:/home
Caches: local
Remotes: s3
Workspace directory: nfs4 on storage:/home
Repo: dvc, git

shcheklein · 2020-12-01T00:58:56Z

the following anti-pattern actually works:

I think this is considered an undefined behavior from the DVC POV. E.g. it won't allow you to specify data.txt as an output (out) in the train stage (while effectively it is an output).

Why isn't it the default behavior?

it seems to me that we can assume that as well (optimistic parallelization). I would prefer this vs creating a separate field in the deps/outs sections.

The biggest question for this ticket would be the scheduler itself then. It's an interesting task. If you agree, let's close this in favor of the #755 ? May be put some comment in that ticket with a summary of this discussion?

It can be nice to also support environment variable passing

looks like a separate feature request? Mind creating a separate ticket for this? so that we stay focused here. I think there are a few workarounds, but it might make sense to have a nice mechanism to pass them (like env section in the stage DSL)

johnnychen94 · 2020-12-01T15:05:56Z

I'm closing this in favor of #5007 and #755

johnnychen94 mentioned this issue Nov 26, 2020

make dvc repro STAGE running all parametrized stages #4958

Closed

johnnychen94 changed the title ~~introduce a read property for better parallelization~~ introduce a readonly property for better parallelization Nov 26, 2020

pmrowla added the A: templating Related to the templating feature label Nov 27, 2020

pmrowla assigned skshetry Nov 27, 2020

skshetry added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint and removed A: templating Related to the templating feature labels Nov 27, 2020

skshetry removed their assignment Nov 27, 2020

This was referenced Dec 1, 2020

Pipeline variables from params file #3633

Closed

no read locks to stage output #5007

Closed

johnnychen94 closed this as completed Dec 1, 2020

skshetry mentioned this issue Feb 12, 2021

Allow to run two (or more) DVC steps that have shared dependencies at the same time #5454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce a readonly property for better parallelization #4979

introduce a readonly property for better parallelization #4979

johnnychen94 commented Nov 26, 2020 •

edited

Loading

skshetry commented Nov 27, 2020

Xyand commented Nov 29, 2020

johnnychen94 commented Nov 30, 2020 •

edited

Loading

shcheklein commented Dec 1, 2020

johnnychen94 commented Dec 1, 2020

introduce a readonly property for better parallelization #4979

introduce a readonly property for better parallelization #4979

Comments

johnnychen94 commented Nov 26, 2020 • edited Loading

skshetry commented Nov 27, 2020

Xyand commented Nov 29, 2020

johnnychen94 commented Nov 30, 2020 • edited Loading

shcheklein commented Dec 1, 2020

johnnychen94 commented Dec 1, 2020

johnnychen94 commented Nov 26, 2020 •

edited

Loading

johnnychen94 commented Nov 30, 2020 •

edited

Loading