Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce a readonly property for better parallelization #4979

Closed
johnnychen94 opened this issue Nov 26, 2020 · 5 comments
Closed

introduce a readonly property for better parallelization #4979

johnnychen94 opened this issue Nov 26, 2020 · 5 comments
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@johnnychen94
Copy link
Contributor

johnnychen94 commented Nov 26, 2020

Training a bunch of models using different parameters now becomes much easier with the parameterization feature, this is a great improvement by deduplicating dvc.yml, but there's still one thing that blocks us from training multiple models asynchronously.

stages:
  train:
    foreach: ${models}
    do:
      cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name} ${item.config}
      deps:
      - data/train/data.h5
      - train
      outs:
      - models/${item.name}

Naturally, we want to schedule multiple jobs asynchronously to different devices without error, e.g.,:

mkdir log
for i in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$i dvc repro train@model_$i > log/model_$i.txt 2>&1 &
    sleep 2
done

but since data/train/train.h5 will be locked and thus we can only run one job at the same time. (we could work around it by creating multiple copies/symlinks but that's not elegant...)

I'm wondering if it's possible to introduce a looser version of read lock that's specified by users from dvc.yaml, e.g.,

 stages:
   train:
     foreach: ${models}
     do:
       cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name}
       deps:
-      - data/train/data.h5
+      - data/train/data.h5:readonly
       - train
       outs:
       - models/${item.name}

When a user adds this property, he's explicitly saying that "okay I plan to use this in a read-only way and I'll take responsibility for whatever bugs it may occur due to my impropriate usage", then dvc could choose to not add an entry for them in rwlock, which enables concurrency.

I'm not sure if DVC has plans to give native support for concurrent job scheduling. With #4976, it can be very promising if dvc repro train@* -s schedules multiple jobs in parallel.


It can be nice to also support environment variable passing, but it's also doable by passing params.yml's values to language's internal utils (e.g., os.environ["CUDA_VISIBLE_DEVICES"]=config["gpu_device"]).

@johnnychen94 johnnychen94 changed the title introduce a read property for better parallelization introduce a readonly property for better parallelization Nov 26, 2020
@pmrowla pmrowla added the A: templating Related to the templating feature label Nov 27, 2020
@skshetry
Copy link
Member

@johnnychen94, great feature request. It should be done similar to the outs section:

deps:
  - data/train/data.h5:
       read_only: True

I don't like that it complicates deps section though.

I'm not sure if DVC has plans to give native support for concurrent job scheduling

We don't have any plans beyond #755 in the short term, which will only wait for locks rather than failing hard. And, we don't have the capacity to work on it right now, so contributions on it will be welcomed.

Regarding scheduling, we are introducing experiments that has good support for scheduling (--queue) and parallelization (--jobs). It is also an opt-only, experimental feature, but you can use same glob and foreach-group-run feature in it already with #4976.

@skshetry skshetry added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint and removed A: templating Related to the templating feature labels Nov 27, 2020
@skshetry skshetry removed their assignment Nov 27, 2020
@Xyand
Copy link

Xyand commented Nov 29, 2020

Why isn't it the default behavior?

I'm new to DVC, but I was under the impression that deps are inputs (read) to the stage while outs get modified (write). Why can't multiple dvc runs read-lock the same dep?

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Nov 30, 2020

I was under the impression that deps are inputs (read) to the stage while outs get modified (write).

I thought the main reason is to prevent the dvc caches from being polluted, but then I realize the following anti-pattern actually works:

stages:
  prepare:
    cmd: echo "generate new data" > data.txt
    outs:
    - data.txt
  train:
    cmd: chmod u+w data.txt && echo "accidentally modified the data" >> data.txt
    deps:
    - data.txt

By adding readonly property, we can add eager md5 checks on the readonly input data, and stop the repro when md5 doesn't match.

It also makes sense to me to (IIUC this is still compatible):

  1. make inputs readonly the default behavior
  2. do md5 checks for inputs after each stage, throw a warning when md5 checks fail
`dvc version`
DVC version: 1.10.1+abb78c.mod
---------------------------------
Platform: Python 3.8.3 on Linux-5.4.0-54-generic-x86_64-with-glibc2.10
Supports: All remotes
Cache types: symlink
Cache directory: nfs4 on storage:/home
Caches: local
Remotes: s3
Workspace directory: nfs4 on storage:/home
Repo: dvc, git

@shcheklein
Copy link
Member

the following anti-pattern actually works:

I think this is considered an undefined behavior from the DVC POV. E.g. it won't allow you to specify data.txt as an output (out) in the train stage (while effectively it is an output).

Why isn't it the default behavior?

it seems to me that we can assume that as well (optimistic parallelization). I would prefer this vs creating a separate field in the deps/outs sections.

The biggest question for this ticket would be the scheduler itself then. It's an interesting task. If you agree, let's close this in favor of the #755 ? May be put some comment in that ticket with a summary of this discussion?

It can be nice to also support environment variable passing

looks like a separate feature request? Mind creating a separate ticket for this? so that we stay focused here. I think there are a few workarounds, but it might make sense to have a nice mechanism to pass them (like env section in the stage DSL)

@johnnychen94
Copy link
Contributor Author

I'm closing this in favor of #5007 and #755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

5 participants