Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue: concurrent workers fail #8763

Closed
mattseddon opened this issue Jan 5, 2023 · 3 comments · Fixed by #9089
Closed

queue: concurrent workers fail #8763

mattseddon opened this issue Jan 5, 2023 · 3 comments · Fixed by #9089
Assignees
Labels
A: task-queue Related to task queue. p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension

Comments

@mattseddon
Copy link
Member

mattseddon commented Jan 5, 2023

Bug Report

Description

Processing queued experiments with 3 concurrent workers with the VS Code extension installed should caused at least one of the workers to fail to reproduce experiments.

Error is caused by:

Traceback (most recent call last):
  File "/example-get-started/.venv/lib/python3.10/site-packages/dulwich/file.py", line 150, in __init__
    fd = os.open(
FileExistsError: [Errno 17] File exists: b'/example-get-started/.git/packed-refs.lock'

After running experiments queue status will look something like this:

❯ dvc queue status
Task     Name        Created    Status
0312db5  herby-cham  03:39 PM   Failed
b7e61b4  rutty-pear  03:38 PM   Success
ab6ba0b  zinky-repp  03:38 PM   Success
56384bf  huger-yard  03:38 PM   Success
ae45100  matte-main  03:38 PM   Success
71a0afb  conic-boll  03:39 PM   Success

Worker status: 0 active, 0 idle

and both of these commands offer no help

❯ dvc queue logs 0312db5 
ERROR: No output logs found for experiment '0312db5'
❯ dvc queue logs herby-cham
ERROR: No output logs found for experiment 'herby-cham'

Reproduce

  1. Open example-get-started inside of VS Code with the DVC extension installed.
  2. Queue several experiments.
  3. Run exp show with watch.
  4. Start the queue with 3 workers.
  5. A worker will likely fail to run one of the experiments.

Expected

Workers should concurrently process the experiments.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.10.6 on macOS-13.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.3.2
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Logs from worker with failed experiment
 -------------- dvc-exp-8acc46-2@localhost v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- macOS-13.1-arm64-arm-64bit 2023-01-05 15:33:19
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         dvc-exp-local:0x1071cabc0
- ** ---------- .> transport:   filesystem://localhost//
- ** ---------- .> results:     file:///Users/mattseddon/projects/example-get-started/.dvc/tmp/exps/celery/result
- *** --- * --- .> concurrency: 1 (thread)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . dvc.repo.experiments.queue.tasks.cleanup_exp
  . dvc.repo.experiments.queue.tasks.collect_exp
  . dvc.repo.experiments.queue.tasks.run_exp
  . dvc.repo.experiments.queue.tasks.setup_exp
  . dvc_task.proc.tasks.run

[2023-01-05 15:33:19,216: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2023-01-05 15:33:19,217: INFO/MainProcess] Connected to filesystem://localhost//
[2023-01-05 15:33:19,219: INFO/MainProcess] dvc-exp-8acc46-2@localhost ready.
[2023-01-05 15:33:19,220: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[e090c4a4-3150-4b40-888f-63be51cd58dc] received
[2023-01-05 15:33:19,330: ERROR/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[e090c4a4-3150-4b40-888f-63be51cd58dc] raised unexpected: FileLocked(b'/Users/mattseddon/projects/example-get-started/.git/packed-refs', b'/Users/mattseddon/projects/example-get-started/.git/packed-refs.lock')
Traceback (most recent call last):
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dulwich/file.py", line 150, in __init__
    fd = os.open(
FileExistsError: [Errno 17] File exists: b'/Users/mattseddon/projects/example-get-started/.git/packed-refs.lock'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/tasks.py", line 109, in run_exp
    executor = setup_exp.s(entry_dict)()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/celery/canvas.py", line 168, in __call__
    return self.type(*args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 735, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/celery/app/task.py", line 392, in __call__
    return self.run(*args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/tasks.py", line 33, in setup_exp
    executor = BaseStashQueue.init_executor(
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 598, in init_executor
    executor.init_git(
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/funcy/flow.py", line 127, in retry
    return call()
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dvc/repo/experiments/executor/local.py", line 105, in init_git
    with get_exp_rwlock(
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dvc/repo/experiments/executor/base.py", line 748, in set_temp_refs
    scm.remove_ref(ref)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 485, in remove_ref
    if not self.repo.refs.remove_if_equals(name_b, old_ref_b):
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dulwich/refs.py", line 972, in remove_if_equals
    self._remove_packed_ref(name)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dulwich/refs.py", line 759, in _remove_packed_ref
    f = GitFile(filename, "wb")
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dulwich/file.py", line 92, in GitFile
    return _GitFile(filename, mode, bufsize, mask)
  File "/Users/mattseddon/projects/example-get-started/.venv/lib/python3.10/site-packages/dulwich/file.py", line 156, in __init__
    raise FileLocked(filename, self._lockfilename) from exc
dulwich.file.FileLocked: (b'/Users/mattseddon/projects/example-get-started/.git/packed-refs', b'/Users/mattseddon/projects/example-get-started/.git/packed-refs.lock')
[2023-01-05 15:33:20,092: INFO/MainProcess] monitor: watching celery worker 'dvc-exp-8acc46-2@localhost'
[2023-01-05 15:33:20,233: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[5f36addd-5653-4f9b-8d2a-7dc36807cbd9] received
[2023-01-05 15:33:37,212: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[5f36addd-5653-4f9b-8d2a-7dc36807cbd9] succeeded in 16.977437333000125s: None
[2023-01-05 15:33:48,166: INFO/MainProcess] monitor: shutting down due to empty queue.
[2023-01-05 15:33:48,168: INFO/MainProcess] monitor: done
[2023-01-05 15:33:48,375: WARNING/MainProcess] Got shutdown from remote

The extension has recently started getting commit information using git log. I thought that this could potentially be causing issues but I can recreate this outside of the VS Code context by running exp show.

@mattseddon mattseddon added product: VSCode Integration with VSCode extension A: task-queue Related to task queue. labels Jan 5, 2023
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Jan 6, 2023
@karajan1001
Copy link
Contributor

@karajan1001
Copy link
Contributor

I tried to implement pygit2 backend for push_refspec which is used during the initialization progress. But when I try to push with refspec refs/foo/bar:refs/foo/bar or refs/foo/bar:refs/remotes/origin/foo/bar it always errored out with local push doesn't (yet) support pushing to non-bare repos.

@pmrowla
Copy link
Contributor

pmrowla commented Jan 17, 2023

@karajan1001 libgit2 doesn't support pushing to non-bare (regular) local repos: https://github.com/libgit2/libgit2/blob/1327dbcf2a4273a8ba6fd978db5f0882530af94d/src/libgit2/transports/local.c#L383

You could still implement it since it would still support pushing to other types of git remotes, but you would have to make sure we catch the error for non-bare local case and re-raise it as NotImplementedError and fallback to another git backend for that case

(but this means we can't use pygit2 push for running experiments locally)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: task-queue Related to task queue. p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants