-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce a readonly property for better parallelization #4979
Comments
@johnnychen94, great feature request. It should be done similar to the deps:
- data/train/data.h5:
read_only: True I don't like that it complicates
We don't have any plans beyond #755 in the short term, which will only wait for locks rather than failing hard. And, we don't have the capacity to work on it right now, so contributions on it will be welcomed. Regarding scheduling, we are introducing |
Why isn't it the default behavior? I'm new to DVC, but I was under the impression that |
I thought the main reason is to prevent the dvc caches from being polluted, but then I realize the following anti-pattern actually works: stages:
prepare:
cmd: echo "generate new data" > data.txt
outs:
- data.txt
train:
cmd: chmod u+w data.txt && echo "accidentally modified the data" >> data.txt
deps:
- data.txt By adding It also makes sense to me to (IIUC this is still compatible):
`dvc version`
|
I think this is considered an undefined behavior from the DVC POV. E.g. it won't allow you to specify
it seems to me that we can assume that as well (optimistic parallelization). I would prefer this vs creating a separate field in the The biggest question for this ticket would be the scheduler itself then. It's an interesting task. If you agree, let's close this in favor of the #755 ? May be put some comment in that ticket with a summary of this discussion?
looks like a separate feature request? Mind creating a separate ticket for this? so that we stay focused here. I think there are a few workarounds, but it might make sense to have a nice mechanism to pass them (like |
Training a bunch of models using different parameters now becomes much easier with the parameterization feature, this is a great improvement by deduplicating
dvc.yml
, but there's still one thing that blocks us from training multiple models asynchronously.Naturally, we want to schedule multiple jobs asynchronously to different devices without error, e.g.,:
but since
data/train/train.h5
will be locked and thus we can only run one job at the same time. (we could work around it by creating multiple copies/symlinks but that's not elegant...)I'm wondering if it's possible to introduce a looser version of read lock that's specified by users from
dvc.yaml
, e.g.,When a user adds this property, he's explicitly saying that "okay I plan to use this in a read-only way and I'll take responsibility for whatever bugs it may occur due to my impropriate usage", then
dvc
could choose to not add an entry for them inrwlock
, which enables concurrency.I'm not sure if DVC has plans to give native support for concurrent job scheduling. With #4976, it can be very promising if
dvc repro train@* -s
schedules multiple jobs in parallel.It can be nice to also support environment variable passing, but it's also doable by passing
params.yml
's values to language's internal utils (e.g.,os.environ["CUDA_VISIBLE_DEVICES"]=config["gpu_device"]
).The text was updated successfully, but these errors were encountered: