Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc exp run --run-all: only runs two experiments and then hangs (similar to #8165) #10398

Closed
mghanava opened this issue Apr 23, 2024 · 3 comments

Comments

@mghanava
Copy link

Bug Report

Description

When running dvc exp run --run-all, only two experiments get run fully, and then the next experiment never starts. I have tried two times and each time two consecutive experiments are run fully.

Reproduce

Add multiple experiments to the queue with dvc exp run --queue
dvc exp run --run-all

Expected

All experiments should run sequentially.

Environment information

Output of dvc doctor:

DVC version: 3.50.0 (pip)

Platform: Python 3.10.12 on Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.15.1
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.4.0
scmrepo = 3.3.1
Supports:
http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.3.1, boto3 = 1.28.64)
Config:
Global: /root/.config/dvc
System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sde
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sde
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/508ec894ab4fcd5646b1ead942fa8b35

Additional Information (if any):

cat .dvc/tmp/exps/celery/dvc-exp-worker-1.out
/usr/local/lib/python3.10/dist-packages/celery/platforms.py:829: SecurityWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

warnings.warn(SecurityWarning(ROOT_DISCOURAGED.format(
[2024-04-18 15:51:26,850: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'

-------------- dvc-exp-4ac376-1@localhost v5.4.0 (opalescent)
--- ***** -----
-- ******* ---- Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 2024-04-18 15:51:26

  • *** --- * ---
  • ** ---------- [config]
  • ** ---------- .> app: dvc-exp-local:0x7f2defb7c700
  • ** ---------- .> transport: filesystem://localhost//
  • ** ---------- .> results: file:///workspaces/platform/.dvc/tmp/exps/celery/result
  • *** --- * --- .> concurrency: 1 (thread)
    -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
    --- ***** -----
    -------------- [queues]
    .> celery exchange=celery(direct) key=celery

[tasks]
. dvc.repo.experiments.queue.tasks.cleanup_exp
. dvc.repo.experiments.queue.tasks.collect_exp
. dvc.repo.experiments.queue.tasks.run_exp
. dvc.repo.experiments.queue.tasks.setup_exp
. dvc_task.proc.tasks.run

[2024-04-18 15:51:26,860: WARNING/MainProcess] /usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(

[2024-04-18 15:51:26,860: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-04-18 15:51:26,860: INFO/MainProcess] Connected to filesystem://localhost//
[2024-04-18 15:51:26,861: INFO/MainProcess] dvc-exp-4ac376-1@localhost ready.
[2024-04-18 15:51:26,862: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[26ac9773-ffd9-4b1e-abbe-dcf5fbbf5741] received
[2024-04-18 15:53:14,714: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[26ac9773-ffd9-4b1e-abbe-dcf5fbbf5741] succeeded in 107.85451585499686s: None
[2024-04-18 15:53:16,067: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[ad2d173c-7348-4013-a417-9d946d3d19f0] received
[2024-04-18 15:55:13,671: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[ad2d173c-7348-4013-a417-9d946d3d19f0] succeeded in 117.60698523299652s: None
[2024-04-18 15:55:15,405: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[df43ce08-360d-44c2-ac3f-b0270fa693e4] received
[2024-04-18 15:56:50,571: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[df43ce08-360d-44c2-ac3f-b0270fa693e4] succeeded in 95.1688057010033s: None
[2024-04-18 15:56:50,715: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[22e9611d-9bb7-4aeb-8227-1cd8c2e4fcf9] received
[2024-04-18 15:58:48,605: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[22e9611d-9bb7-4aeb-8227-1cd8c2e4fcf9] succeeded in 117.89407574100187s: None
[2024-04-18 15:59:02,593: INFO/MainProcess] monitor: shutting down due to empty queue.
[2024-04-18 15:59:02,966: WARNING/MainProcess] Got shutdown from remote
[2024-04-18 15:59:02,968: INFO/MainProcess] cleaning up FSApp broker.
[2024-04-18 15:59:02,981: INFO/MainProcess] done
/usr/local/lib/python3.10/dist-packages/celery/platforms.py:829: SecurityWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

warnings.warn(SecurityWarning(ROOT_DISCOURAGED.format(
[2024-04-20 14:01:21,608: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'

-------------- dvc-exp-4ac376-1@localhost v5.4.0 (opalescent)
--- ***** -----
-- ******* ---- Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 2024-04-20 14:01:21

  • *** --- * ---
  • ** ---------- [config]
  • ** ---------- .> app: dvc-exp-local:0x7ff4e3aac6d0
  • ** ---------- .> transport: filesystem://localhost//
  • ** ---------- .> results: file:///workspaces/platform/.dvc/tmp/exps/celery/result
  • *** --- * --- .> concurrency: 1 (thread)
    -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
    --- ***** -----
    -------------- [queues]
    .> celery exchange=celery(direct) key=celery

[tasks]
. dvc.repo.experiments.queue.tasks.cleanup_exp
. dvc.repo.experiments.queue.tasks.collect_exp
. dvc.repo.experiments.queue.tasks.run_exp
. dvc.repo.experiments.queue.tasks.setup_exp
. dvc_task.proc.tasks.run

[2024-04-20 14:01:21,621: WARNING/MainProcess] /usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(

[2024-04-20 14:01:21,621: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-04-20 14:01:21,621: INFO/MainProcess] Connected to filesystem://localhost//
[2024-04-20 14:01:21,623: INFO/MainProcess] dvc-exp-4ac376-1@localhost ready.
[2024-04-20 14:01:21,624: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[0285ad2f-cc55-487b-bb72-a22cddbb560d] received
[2024-04-22 01:08:08,508: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[0285ad2f-cc55-487b-bb72-a22cddbb560d] succeeded in 3015.664237446s: None
[2024-04-22 01:08:10,100: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[1741062f-5998-4ed3-8a2b-f0726876eeee] received
[2024-04-22 01:24:01,119: CRITICAL/MainProcess] Unrecoverable error: JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/celery/worker/worker.py", line 202, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
File "/usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py", line 340, in start
blueprint.start(self)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py", line 746, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.10/dist-packages/celery/worker/loops.py", line 130, in synloop
connection.drain_events(timeout=2.0)
File "/usr/local/lib/python3.10/dist-packages/kombu/connection.py", line 341, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 997, in drain_events
get(self._deliver, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 1035, in _drain_channel
return channel.drain_events(callback=callback, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 754, in drain_events
return self._poll(self.cycle, callback, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 414, in _poll
return cycle.get(callback)
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 417, in _get_and_deliver
message = self._get(queue)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/filesystem.py", line 261, in _get
return loads(bytes_to_str(payload))
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/json.py", line 93, in loads
return _loads(s, object_hook=object_hook)
File "/usr/lib/python3.10/json/init.py", line 359, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[2024-04-22 01:53:58,451: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[1741062f-5998-4ed3-8a2b-f0726876eeee] succeeded in 2748.4771515409993s: None
[2024-04-22 01:53:58,452: INFO/MainProcess] cleaning up FSApp broker.
[2024-04-22 01:53:58,566: INFO/MainProcess] done
/usr/local/lib/python3.10/dist-packages/celery/platforms.py:829: SecurityWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

warnings.warn(SecurityWarning(ROOT_DISCOURAGED.format(
[2024-04-22 15:19:31,011: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'

-------------- dvc-exp-4ac376-1@localhost v5.4.0 (opalescent)
--- ***** -----
-- ******* ---- Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 2024-04-22 15:19:31

  • *** --- * ---
  • ** ---------- [config]
  • ** ---------- .> app: dvc-exp-local:0x7f8037ae4700
  • ** ---------- .> transport: filesystem://localhost//
  • ** ---------- .> results: file:///workspaces/platform/.dvc/tmp/exps/celery/result
  • *** --- * --- .> concurrency: 1 (thread)
    -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
    --- ***** -----
    -------------- [queues]
    .> celery exchange=celery(direct) key=celery

[tasks]
. dvc.repo.experiments.queue.tasks.cleanup_exp
. dvc.repo.experiments.queue.tasks.collect_exp
. dvc.repo.experiments.queue.tasks.run_exp
. dvc.repo.experiments.queue.tasks.setup_exp
. dvc_task.proc.tasks.run

[2024-04-22 15:19:31,019: WARNING/MainProcess] /usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(

[2024-04-22 15:19:31,019: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-04-22 15:19:31,020: INFO/MainProcess] Connected to filesystem://localhost//
[2024-04-22 15:19:31,021: INFO/MainProcess] dvc-exp-4ac376-1@localhost ready.
[2024-04-22 15:19:31,021: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[3e30a522-de58-49dc-8e93-f029aa9a015d] received
[2024-04-22 15:52:03,066: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-04-22 16:10:49,673: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[3e30a522-de58-49dc-8e93-f029aa9a015d] succeeded in 3078.8052399700027s: None
[2024-04-22 16:10:50,675: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[99127667-18e3-4d3e-bcf4-1dd02e78e426] received
[2024-04-22 16:48:49,787: CRITICAL/MainProcess] Unrecoverable error: JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/celery/worker/worker.py", line 202, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
File "/usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py", line 340, in start
blueprint.start(self)
File "/usr/local/lib/python3.10/dist-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/usr/local/lib/python3.10/dist-packages/celery/worker/consumer/consumer.py", line 746, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.10/dist-packages/celery/worker/loops.py", line 130, in synloop
connection.drain_events(timeout=2.0)
File "/usr/local/lib/python3.10/dist-packages/kombu/connection.py", line 341, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 997, in drain_events
get(self._deliver, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 1035, in _drain_channel
return channel.drain_events(callback=callback, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 754, in drain_events
return self._poll(self.cycle, callback, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 414, in _poll
return cycle.get(callback)
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/virtual/base.py", line 417, in _get_and_deliver
message = self._get(queue)
File "/usr/local/lib/python3.10/dist-packages/kombu/transport/filesystem.py", line 261, in _get
return loads(bytes_to_str(payload))
File "/usr/local/lib/python3.10/dist-packages/kombu/utils/json.py", line 93, in loads
return _loads(s, object_hook=object_hook)
File "/usr/lib/python3.10/json/init.py", line 359, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[2024-04-22 17:06:12,589: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[99127667-18e3-4d3e-bcf4-1dd02e78e426] succeeded in 3322.133445313004s: None
[2024-04-22 17:06:12,590: INFO/MainProcess] cleaning up FSApp broker.
[2024-04-22 17:06:12,762: INFO/MainProcess] done

@dberenbaum
Copy link
Collaborator

I see there are errors in the logs you provided. Do the experiments complete successfully? Do you consistently get that JSONDecodeError? I see a similar error in #9358.

@mghanava
Copy link
Author

Yes two of experiments ran completely. The error just showed itself in cat .dvc/tmp/exps/celery/dvc-exp-worker-1.out logs. However, now the issue is resolved and I was able to run rest of experiments without any issue (had 15 experiments; the first 4 raised this issue but from experiment 5 it ran till the end)

@dberenbaum
Copy link
Collaborator

Thanks. I'll close this one then, but if you're able to consistently reproduce it again, we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants