Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: release locks when running a command #2584

Merged
merged 1 commit into from
Dec 13, 2019
Merged

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Oct 8, 2019

Currently only one instance of dvc could run in a particular repo
because of the .dvc/lock. That gets furstrating when you are running
some long-running command and blocked from doing anything else.
This PR solved that issue for dvc run by releasing the repo lock
while running the command. To protect against conflicts, we now use
rwlock, which keeps the list of readers and writers for particular
paths. The process looks like this:

  1. aquire read locks for deps;
  2. aquire write locks for outs;
  3. release repo lock;
  4. run cmd;
  5. aquire repo lock;
  6. release write locks for outs;
  7. release read locks for outs;

We also take rwlocks for operations such as import, add and checkout
to ensure that we won't hit the same files that are being used by
another dvc run in the background.

We don't aquire rwlocks for the stage file itself, as we don't really
care about anyone overwriting a stage file, as there is a very little
difference here between parallel runs and sequential runs that overwrite
the same stage file.

Next steps might be to release repo lock on import(when downloading
stuff) and when computing checksums or copying files.

Related to #755

TODO:

  • improve errors. Maybe save cmd to the rwlock info too to provide a friendlier errror?
  • use path_isin to check for cases where we depend on a file in a directory that is output to another stage
  • put asserts to protect from accidental reentrance
  • consider versioning for rwlock file or at least not silently overwriting it if we get something we don't expect dvc: release locks when running a command #2584 (comment)
  • wait indefinitely until we get a repo lock back after running the command No urgent need, user can dvc run --no-exec + dvc commit if it fails to aquire the lock back.
  • tests

@efiop efiop changed the title [WIP] dvc: release locks when running a command dvc: release locks when running a command Oct 8, 2019
@efiop efiop force-pushed the 755 branch 3 times, most recently from f10dca3 to 9c523d7 Compare October 8, 2019 12:41
@efiop efiop changed the title dvc: release locks when running a command [WIP] dvc: release locks when running a command Oct 8, 2019
@efiop efiop force-pushed the 755 branch 8 times, most recently from 2685ac5 to fef90bc Compare October 9, 2019 21:25
@efiop efiop changed the title [WIP] dvc: release locks when running a command dvc: release locks when running a command Oct 10, 2019
@efiop efiop changed the title dvc: release locks when running a command [WIP] dvc: release locks when running a command Oct 10, 2019
@efiop efiop force-pushed the 755 branch 2 times, most recently from 01275c1 to 752ffe2 Compare October 10, 2019 14:05
@efiop efiop changed the title [WIP] dvc: release locks when running a command dvc: release locks when running a command Oct 10, 2019
@efiop efiop force-pushed the 755 branch 2 times, most recently from ce14d90 to 681a7c9 Compare October 14, 2019 21:07
@efiop efiop requested review from Suor, pared and a user October 15, 2019 11:35
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph shouldn't be recollected while repo lock is off, ideally we should assert that, this might capture some nasty errors now or in the future.

Looks like we use @locked_stage on quite small ops. This arises several questions:

  • how do you decide what to wrap and what to not? I.e. why Stage.commit() is not locked?
  • how will it affect the performance?
  • what will happen if some locked op calls another locked op? Doesn't look like we provide reentrance, will we get AlreadyLocked error or smth?

Overall this looks fragile. The presupposition I guess should be that the dvc process might be in two states:

  • collecting/updating stage files (Repo lock should be on)
  • running command/updating stage outputs (Stage locked, repo unlocked)

So the structure of code and maybe some asserts should check that:

  • we are not collecting stages, while repo lock is off
  • some lock is always on, there shouldn't be inbetweens
  • we only update (read?) stage outs when stage lock is on

Not sure how this works with deps.

dvc/stage/lock.py Outdated Show resolved Hide resolved
dvc/repo/imp_url.py Outdated Show resolved Hide resolved
dvc/stage/__init__.py Outdated Show resolved Hide resolved
dvc/stage/lock.py Outdated Show resolved Hide resolved
tests/unit/test_stage.py Show resolved Hide resolved
@Suor
Copy link
Contributor

Suor commented Oct 16, 2019

This probably looks vague because it's nowhere explained what stage lock means. And there are lots of possibilities:

  • stage file should not be changed/read from outside
  • outs should not be changed/read from outside
  • deps should not be changed/read from outside

Or any combination of the above. If say stage lock means exclusive access to outs then it explains why we lock stage and its dep stages (their outs are our deps) on run. This, however, means that Stage.commit() should lock the stage because it accesses its outs.

This should be stated very clearly. Until questions like "may I dump the stage while I don't have repo lock?" will arise.

@efiop efiop force-pushed the 755 branch 2 times, most recently from b7d9e4b to 783ac18 Compare October 17, 2019 12:51
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shouldn't Stage._compute_md5 be wrapped in
@rwlocked(read=["deps","outs"])?

@efiop
Copy link
Contributor Author

efiop commented Dec 9, 2019

@pared Good catch! You are right, it should even be that for the changed as a whole, same as with status, which is already wrapped. Fixing right now. Thanks!

@pared
Copy link
Contributor

pared commented Dec 9, 2019

One thing about the reproduce and update:
Seems like those two commands also can modify the outputs, however they rely on _run somewhere down the line. There is a risk that we will receive LockError inside reproduce, between changed and run calls, though we should be safe here, as changed check just reads the outputs.

@efiop
Copy link
Contributor Author

efiop commented Dec 9, 2019

@pared Good point! So looks like we do need some reentrance after all. Taking a look...

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we should use inside-out schema, e.g. {'read': {str: [INFO]}, 'write': {str: INFO}}. Code samples below become simpler and more efficient. If you worry about paths being duplicated than they won't be since one path might only be locked either for read or write.

dvc/path_info.py Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
@efiop efiop force-pushed the 755 branch 3 times, most recently from d3b02c6 to e59cbf9 Compare December 11, 2019 05:50
@efiop
Copy link
Contributor Author

efiop commented Dec 11, 2019

Adjusted the schema and made rwlock reentrant, so @rwlocked is now placed on public methods of Stage, which is much nicer than what it was before.

@efiop efiop requested a review from Suor December 11, 2019 17:37
dvc/path_info.py Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Outdated Show resolved Hide resolved
dvc/rwlock.py Show resolved Hide resolved
dvc/rwlock.py Show resolved Hide resolved
@efiop efiop force-pushed the 755 branch 4 times, most recently from cd0b668 to 8588360 Compare December 13, 2019 04:57
@efiop efiop requested a review from Suor December 13, 2019 05:47
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically is done.

Two things on tests:

  • tmpdir is deprecated, it's tmp_path now
  • need to test that reentrance won't work for different pids

Currently only one instance of dvc could run in a particular repo
because of the `.dvc/lock`. That gets furstrating when you are running
some long-running command and blocked from doing anything else.
This PR solved that issue for `dvc run` by releasing the repo lock
while running the command. To protect against conflicts, we now use
rwlock, which keeps the list of readers and writers for particular
paths. The process looks like this:

1) aquire read locks for deps;
2) aquire write locks for outs;
3) release repo lock;
4) run cmd;
5) aquire repo lock;
6) release write locks for outs;
7) release read locks for outs;

We also take rwlocks for operations such as `import`, `add` and `checkout`
to ensure that we won't hit the same files that are being used by
another `dvc run` in the background.

We don't aquire rwlocks for the stage file itself, as we don't really
care about anyone overwriting a stage file, as there is a very little
difference here between parallel runs and sequential runs that overwrite
the same stage file.

Next steps might be to release repo lock on `import`(when dowloading
stuff) and when computing checksums or copying files.

Fixes iterative#755
@efiop
Copy link
Contributor Author

efiop commented Dec 13, 2019

tmpdir is deprecated, it's tmp_path now

Fixed.

need to test that reentrance won't work for different pids

Not sure I follow. It sounds like a regular rwlock operation to me. The access is info-based (pid + cmd), so pretty much any test like https://github.com/iterative/dvc/pull/2584/files#diff-15338413b9f377c5906d1b1b3498f98cR32 tests that. Could you elaborate?

@efiop efiop requested a review from Suor December 13, 2019 08:42
@Suor
Copy link
Contributor

Suor commented Dec 13, 2019

The access is info-based (pid + cmd),

I see, you use different commands, this should be resolved then.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End on the epoch 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants