Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments table not updating while experiments are running #4528

Closed
aschuh-hf opened this issue Aug 18, 2023 · 26 comments · Fixed by #4529 or #4579
Closed

Experiments table not updating while experiments are running #4528

aschuh-hf opened this issue Aug 18, 2023 · 26 comments · Fixed by #4529 or #4579
Assignees
Labels
A: experiments Area: experiments table webview and everything related priority-p1 Regular product backlog

Comments

@aschuh-hf
Copy link

aschuh-hf commented Aug 18, 2023

The metrics in the Experiments table are not following the results of multiple parallel experiment runs even though dvc exp show does show the different updated metrics. The Experiments table mainly contains the original base commit metrics.

This issue may be because it seems that the extension does not call dvc exp show. Instead, I see lots of dvc stage list commands in the "DVC" output window.

[version: 1.0.43, 2023-08-18T00:34:42.565Z, pid: 36047] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:43.435Z, pid: 36047] > dvc stage list - COMPLETED (912ms)
[version: 1.0.43, 2023-08-18T00:34:43.473Z, pid: 36050] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:43.487Z, pid: 35221] > dvc data status --granular --unchanged --json - COMPLETED (141986ms)
[version: 1.0.43, 2023-08-18T00:34:47.993Z, pid: 36050] > dvc stage list - COMPLETED (4556ms)
[version: 1.0.43, 2023-08-18T00:34:48.026Z, pid: 36054] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:49.929Z, pid: 36054] > dvc stage list - COMPLETED (1935ms)
[version: 1.0.43, 2023-08-18T00:34:49.981Z, pid: 36073] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:50.667Z, pid: 36073] > dvc stage list - COMPLETED (737ms)
[version: 1.0.43, 2023-08-18T00:34:50.715Z, pid: 36076] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:51.405Z, pid: 36076] > dvc stage list - COMPLETED (736ms)
[version: 1.0.43, 2023-08-18T00:34:51.452Z, pid: 36079] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:52.142Z, pid: 36079] > dvc stage list - COMPLETED (735ms)
[version: 1.0.43, 2023-08-18T00:34:52.196Z, pid: 36082] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:52.857Z, pid: 36082] > dvc stage list - COMPLETED (713ms)
[version: 1.0.43, 2023-08-18T00:34:52.900Z, pid: 36085] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:53.456Z, pid: 36085] > dvc stage list - COMPLETED (597ms)
[version: 1.0.43, 2023-08-18T00:34:53.496Z, pid: 36091] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:54.151Z, pid: 36091] > dvc stage list - COMPLETED (694ms)
[version: 1.0.43, 2023-08-18T00:34:54.199Z, pid: 36108] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:54.906Z, pid: 36108] > dvc stage list - COMPLETED (754ms)
[version: 1.0.43, 2023-08-18T00:34:54.953Z, pid: 36111] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:55.649Z, pid: 36111] > dvc stage list - COMPLETED (742ms)
[version: 1.0.43, 2023-08-18T00:34:55.696Z, pid: 36114] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:56.421Z, pid: 36114] > dvc stage list - COMPLETED (770ms)
[version: 1.0.43, 2023-08-18T00:34:56.466Z, pid: 36117] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:57.171Z, pid: 36117] > dvc stage list - COMPLETED (749ms)
[version: 1.0.43, 2023-08-18T00:34:57.223Z, pid: 36120] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:57.986Z, pid: 36120] > dvc stage list - COMPLETED (814ms)
[version: 1.0.43, 2023-08-18T00:34:58.026Z, pid: 36123] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:58.730Z, pid: 36123] > dvc stage list - COMPLETED (744ms)
[version: 1.0.43, 2023-08-18T00:34:58.768Z, pid: 36129] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:34:59.465Z, pid: 36129] > dvc stage list - COMPLETED (735ms)
[version: 1.0.43, 2023-08-18T00:34:59.520Z, pid: 36146] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:35:00.265Z, pid: 36146] > dvc stage list - COMPLETED (799ms)

There are plenty more of these; some failed with a validation of the dvc.lock file, e.g.,

[version: 1.0.43, 2023-08-18T00:35:35.341Z, pid: 36431] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:35:36.371Z, pid: 36431] > dvc stage list - FAILED with code 255 (1063ms)
'./dvc.lock' validation failed: 13 errors.



extra keys not allowed, in create_index_table, line 2, column 3

    1 create_index_table:                                                       

    2   cmd: python -m scripts.create_index_table

...

The latest commands shown at the moment are:

[version: 1.0.43, 2023-08-18T00:38:57.834Z, pid: 37297] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:38:58.575Z, pid: 37297] > dvc stage list - COMPLETED (772ms)
[version: 1.0.43, 2023-08-18T00:38:58.606Z, pid: 37300] > dvc stage list - INITIALIZED
[version: 1.0.43, 2023-08-18T00:39:01.712Z, pid: 37300] > dvc stage list - COMPLETED (3136ms)
[version: 1.0.43, 2023-08-18T07:59:14.480Z, pid: 52404] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T07:59:14.509Z, pid: 52405] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T07:59:14.541Z, pid: 52406] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T07:59:14.560Z, pid: 52405] > git ls-files --others --exclude-standard - COMPLETED (79ms)
[version: 1.0.43, 2023-08-18T07:59:14.592Z, pid: 52406] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (82ms)
[version: 1.0.43, 2023-08-18T08:01:36.953Z, pid: 52404] > dvc data status --granular --unchanged --json - COMPLETED (142502ms)
[version: 1.0.43, 2023-08-18T08:01:41.251Z, pid: 52496] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T08:01:41.311Z, pid: 52497] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T08:01:41.374Z, pid: 52498] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T08:01:41.379Z, pid: 52497] > git ls-files --others --exclude-standard - COMPLETED (127ms)
[version: 1.0.43, 2023-08-18T08:01:41.428Z, pid: 52498] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (117ms)
[version: 1.0.43, 2023-08-18T08:03:58.170Z, pid: 52496] > dvc data status --granular --unchanged --json - COMPLETED (136975ms)

But dvc exp show does not appear even though this is after several hours of these experiments running.

@mattseddon
Copy link
Member

I suspect that this is related to iterative/dvc#9860 as we use the metrics/params file paths provided by exp show to work out what to watch in the workspace. If those paths are incorrect then we don't know what to watch for updates.

There could be some other error though so if you could please take a look at the console in VS Code's developer tools and share any errors that seem to be related to DVC that would be very helpful. You can access the console like this:

image image

Thanks.

@mattseddon mattseddon added A: experiments Area: experiments table webview and everything related triage labels Aug 18, 2023
@aschuh-hf
Copy link
Author

There are some errors relating to dvc exp show, though not sure if these are the most recent (i.e., related to the six experiments that I started after reporting iterative/dvc#9860).

The experiment direr-sous was one I had run before in parallel to the experiment moldy-sida which finished fine, but then with incorrect metrics.json and params.yaml paths as reported in the referenced issue. However, note that the paths still exist and the metrics for that finished experiments were shown in the table (and dvc exp show --json output). It's just that they were shown in new columns because of the change in paths (projects/multiseries/trials/pairwise_selected_target_series/data/train/metrics.json vs. data/train/metrics.json). The path projects/multiseries/trials/pairwise_selected_target_series is my DVC_ROOT.

Screen Shot 2023-08-18 at 10 55 03 AM Screen Shot 2023-08-18 at 10 54 38 AM

My current HEAD commit SHA is c7c4d83d358e06b0bca753cbcddffb777fdf140c, i.e., the first --rev argument indeed seen in these error messages.

@aschuh-hf
Copy link
Author

Is it also expected that the "DVC: Refresh Experiments" VS Code Extension command does not work?

Screen Shot 2023-08-18 at 11 09 29 AM

(I don't know what it does, just thought I'd try and see if it makes a difference)

@aschuh-hf
Copy link
Author

Following the suggestion by @julieg18 to upgrade DVC to see if the other issue persists, I updated DVC to 3.15.2.

Not sure if that did anything or something else, but the Experiments table just updated. I see in the "DVC" output window:

[version: 1.0.43, 2023-08-18T10:19:04.300Z, pid: 58052] > dvc exp show --rev c7c4d83d358e06b0bca753cbcddffb777fdf140c --rev f7b9c12f505cb6498ab651df4be1447db610c76b --rev 9aca8f642eee1b3f816b4d4c587f525ad6540bc0 --rev c29975b5be3e984ede48359ff07c80d60f6013a0 --rev dc93a51d1b6cc33881e36acedf3ccfc60de0b3ad --rev 4717d23247ecd1ecc5422b81e5ef122f693809f2 --rev 0819967d5197d490b590c41e1227a45e65c4285e --json - COMPLETED (17228ms)

@aschuh-hf
Copy link
Author

This issue persists still with DVC 3.15.

Could it be that dvc exp show --json is not being called more often because of very high latency of a number of DVC CLI commands required by the extension to update its views. That Experiments table hasn't been updated in more than one hour.

[version: 1.0.43, 2023-08-18T12:24:54.403Z, pid: 7583] > dvc exp show --rev 82ad2da35ad3ce4d5b7a7a468bfcacd6668daa2a --rev 0433360e53967c48dfc4c64d90768bbce5c6ac3e --rev e38e2882d416ecd4c2fbb1d26a8778c984d43472 --rev c7c4d83d358e06b0bca753cbcddffb777fdf140c --rev f7b9c12f505cb6498ab651df4be1447db610c76b --json - COMPLETED (21754ms)
[version: 1.0.43, 2023-08-18T12:35:52.149Z, pid: 33654] > dvc queue logs ochre-boff --follow - COMPLETED (43766543ms)
[version: 1.0.43, 2023-08-18T12:35:53.953Z, pid: 32111] > dvc queue logs licit-bish --follow - COMPLETED (51036871ms)
[version: 1.0.43, 2023-08-18T13:28:57.262Z, pid: 8454] > dvc queue logs licit-bish --follow - INITIALIZED
[version: 1.0.43, 2023-08-18T13:37:40.140Z, pid: 9271] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T13:37:40.181Z, pid: 9272] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T13:37:40.221Z, pid: 9273] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T13:37:40.233Z, pid: 9272] > git ls-files --others --exclude-standard - COMPLETED (92ms)
[version: 1.0.43, 2023-08-18T13:37:40.259Z, pid: 9273] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (77ms)
[version: 1.0.43, 2023-08-18T13:40:31.328Z, pid: 9271] > dvc data status --granular --unchanged --json - COMPLETED (171232ms)
[version: 1.0.43, 2023-08-18T13:40:35.480Z, pid: 9349] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T13:40:35.544Z, pid: 9350] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T13:40:35.610Z, pid: 9351] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T13:40:35.616Z, pid: 9350] > git ls-files --others --exclude-standard - COMPLETED (135ms)
[version: 1.0.43, 2023-08-18T13:40:35.667Z, pid: 9351] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (122ms)
[version: 1.0.43, 2023-08-18T13:43:02.702Z, pid: 9349] > dvc data status --granular --unchanged --json - COMPLETED (147277ms)
[version: 1.0.43, 2023-08-18T13:43:06.743Z, pid: 9731] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T13:43:06.803Z, pid: 9732] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T13:43:06.856Z, pid: 9733] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T13:43:06.861Z, pid: 9732] > git ls-files --others --exclude-standard - COMPLETED (117ms)
[version: 1.0.43, 2023-08-18T13:43:06.892Z, pid: 9733] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (88ms)
[version: 1.0.43, 2023-08-18T13:45:31.910Z, pid: 9731] > dvc data status --granular --unchanged --json - COMPLETED (145227ms)
[version: 1.0.43, 2023-08-18T13:45:36.685Z, pid: 9815] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T13:45:36.740Z, pid: 9816] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T13:45:36.811Z, pid: 9817] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T13:45:36.817Z, pid: 9816] > git ls-files --others --exclude-standard - COMPLETED (131ms)
[version: 1.0.43, 2023-08-18T13:45:36.862Z, pid: 9817] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (121ms)
[version: 1.0.43, 2023-08-18T13:48:01.662Z, pid: 9815] > dvc data status --granular --unchanged --json - COMPLETED (145031ms)
[version: 1.0.43, 2023-08-18T13:48:06.399Z, pid: 10297] > dvc data status --granular --unchanged --json - INITIALIZED
[version: 1.0.43, 2023-08-18T13:48:06.451Z, pid: 10298] > git ls-files --others --exclude-standard - INITIALIZED
[version: 1.0.43, 2023-08-18T13:48:06.514Z, pid: 10299] > git ls-files --others --exclude-standard --directory --no-empty-directory - INITIALIZED
[version: 1.0.43, 2023-08-18T13:48:06.522Z, pid: 10298] > git ls-files --others --exclude-standard - COMPLETED (122ms)
[version: 1.0.43, 2023-08-18T13:48:06.590Z, pid: 10299] > git ls-files --others --exclude-standard --directory --no-empty-directory - COMPLETED (139ms)
[version: 1.0.43, 2023-08-18T13:48:32.107Z, pid: 10321] > dvc queue logs dazed-tomb --follow - INITIALIZED
[version: 1.0.43, 2023-08-18T13:50:31.104Z, pid: 10297] > dvc data status --granular --unchanged --json - COMPLETED (144758ms)

@aschuh-hf
Copy link
Author

I could use "DVC: Reset Persisted State and Reload Window" from the Command Palette to refresh the table, but this takes long as it has to regenerate a lot of cached data. Would it make sense to have a "Refresh" button at the top of Experiments (and Plots) editor tabs? Similar to the"Refresh Explorer" button in the file explorer?

@shcheklein shcheklein reopened this Aug 18, 2023
@mattseddon
Copy link
Member

@aschuh-hf is this still happening?

@mattseddon mattseddon self-assigned this Aug 22, 2023
@shcheklein shcheklein added the priority-p1 Regular product backlog label Aug 22, 2023
@mattseddon mattseddon removed the triage label Aug 22, 2023
@aschuh-hf
Copy link
Author

@mattseddon Yes, this is still happening. I currently have an experiment running and the table does not update. My workaround is to use the "+" and/or "-" buttons to include more or fewer commits because this triggers a refresh of the cashed dvc exp show --json output.

What events would trigger the update? Could it have anything to do with my use of aws s3 sync to update the metrics JSON file during execution of the pipeline train stage shell script?

@mattseddon
Copy link
Member

What events would trigger the update?

The extension uses a file watcher to call exp show for updates. VS Code provides events for any changes to files in the workspace. We use the following criteria to filter those events for experiment updates:

path => {
    const relPath = relative(this.dvcRoot, path)
    if (
      this.getWatchedFiles().some(
        watchedRelPath =>
          path.endsWith(watchedRelPath) ||
          isSameOrChild(relPath, watchedRelPath)
      ) &&
      !isPathInSubProject(relPath, this.relSubProjects)
    ) {
      void this.managedUpdate(path)
    }
}

Where watched files are all of the params and metrics files shown for the workspace in exp show + any dvc.lock, dvc.yaml or .dvc.

To check if the file system watchers are working you can save (without making any updates) either your params.yaml or dvc.yaml file for the project. If that doesn't trigger an update then the file watchers are not working as expected.

If you are not writing metrics to files in the workspace at all then that would be an issue.

@aschuh-hf
Copy link
Author

aschuh-hf commented Aug 24, 2023

Thanks, that is useful. My workspace is a .dvc/tmp/exps/tmp* folder because the experiment is running as queue task.

dvc exp run --queue
dvc queue start -j 1

(though I'm actually using the VS Code Extension to trigger these commands from the "Experiments" tab)

When I navigate to the respective experiment folder in the temp folder of the running experiment and save the dvc.yaml or params.yaml file without making changes, I can see in the DVC output window that it runs dvc exp show --json.

However, when I do the same with the metrics.json file or any of the plots TSV files, nothing happens. But only these files get modified during the execution of the experiment, and the dvc.lock file is written by DVC only once the experiment finished.

@mattseddon
Copy link
Member

What keys are shown underneath the params/metrics key in your exp show --json output?

e.g

...
      "params": {
        "params.yaml": {
...
      "metrics": {
        "training/metrics.json": {
...

is the metrics.json file path shown in the data?

Are you using a mono-repo and do you have the mono-repo open at the root?

@aschuh-hf
Copy link
Author

aschuh-hf commented Aug 24, 2023

Are you using a mono-repo and do you have the mono-repo open at the root?

I'm in a mono-repo and have the DVC project (path where dvc init --subdir was run) open as VS Code Workspace
(path projects/multiseries/trials/pairwise_selected_target_series/ from iterative/dvc#9860).

GIT_REPO_ROOT/vscode/project.vscode-workspace

{
	"folders": [
		{
			"path": "../projects/multiseries/trials/pairwise_selected_target_series"
		}
	],
        // ...
}

What keys are shown underneath the params/metrics key in your exp show --json output?

  • "params": {"params.yaml": ...}
  • "metrics": {"data/train/metrics.json": ...}

These are the paths that I expect.

@aschuh-hf
Copy link
Author

Re #4564: The dvc exp show --json keys for params and metrics are the same for all experiments.

They match the keys under workspace.

@mattseddon
Copy link
Member

What else are you using the GIT_REPO_ROOT/vscode/project.vscode-workspace file for? Do you need that file if you aren't using a multi-folder workspace?

@aschuh-hf
Copy link
Author

aschuh-hf commented Aug 24, 2023

It's not that I need it as I could use .vscode/*.json files inside the DVC project folder, but I prefer having all VS Code settings (settings, debug commands, recommended extensions) in a single configuration file, and also all different workspaces for different projects I work on in one location. For different projects / sub-projects, I have a separate .vscode-workspace file at the mono-repo root (vscode/ folder, respectively). Some include a single folder, some include multiple if I commonly also edit shared Python libs or other used files from other projects.

I would expect DVC to work either way? Whether I open a folder or use a single- or multi-folder workspace.

@mattseddon
Copy link
Member

mattseddon commented Aug 24, 2023

By any chance are you opening/closing/reloading VS Code between queueing these experiments up and then running them?

Edit: Not relevant.

@mattseddon
Copy link
Member

However, when I do the same with the metrics.json file or any of the plots TSV files, nothing happens. But only these files get modified during the execution of the experiment, and the dvc.lock file is written by DVC only once the experiment finished.

Would you be able to check if events are getting created (at all) for the files in question using the instructions here: https://github.com/microsoft/vscode/wiki/File-Watcher-Issues#logging-local

@aschuh-hf
Copy link
Author

aschuh-hf commented Aug 24, 2023

The previous experiment was unfortunately already done and the temp folder deleted.

I enabled Trace log level and started a new experiment. I notice now that even when I save dvc.yaml in the temp folder of the running experiment task this time, it does not trigger dvc exp show --json. There are no file watcher events in the Console log either. At this point, no metrics.json file has been written yet and was removed by dvc exp run. The table still shows the initial metrics from the base rev commit. After the first aws s3 sync which writes the metrics.json file, there is still no file watcher event. So no matter which file I touch, no events based on file watchers seem to be triggered.

When I touch the dvc.yaml in the workspace instead of the temp folder of the running queued experiment, then they get triggered and the table updates.

@aschuh-hf
Copy link
Author

aschuh-hf commented Aug 24, 2023

I was watching the local logs. I am actually using SSH Remote, but I cannot find the "Log (Remote Server)" output in the dropdown list as mentioned at https://github.com/microsoft/vscode/wiki/File-Watcher-Issues#logging-remote.

Nevermind, I found the file watcher log outputs on the remote in the "Server" (not "Log (Remote Server)") output.

2023-08-24 13:45:08.216 [trace] [File Watcher (parcel)] [CHANGED] /data/aschuh/hf-research/projects/multiseries/trials/pairwise_selected_target_series/.dvc/tmp/exps/tmp13b5j4fa/projects/multiseries/trials/pairwise_selected_target_series/dvc.yaml
2023-08-24 13:45:08.291 [trace] [File Watcher (parcel)]  >> normalized [CHANGED] /data/aschuh/hf-research/projects/multiseries/trials/pairwise_selected_target_series/.dvc/tmp/exps/tmp13b5j4fa/projects/multiseries/trials/pairwise_selected_target_series/dvc.yaml

This is when I now touch the dvc.yaml in the temp folder of the running experiment. For some reason, it also doesn't trigger dvc exp show --json in the DVC output window even though this wasn't the case when I tried it earlier.

Note that I had reloaded the window since my previous comment (i.e., since starting the experiment) while I was looking for the trace output of the file watcher.

When I touch the metrics.json in the temp of the running experiment, no file watcher trace log entry is being recorded.

@mattseddon
Copy link
Member

Looks like this is the issue: microsoft/vscode#176327

@mattseddon
Copy link
Member

@aschuh-hf as you can see from the above PRs I've been working on getting this fixed.

The next version of DVC will contain a DVC_ROOT env var for queued experiments. That will be used by DVCLive (in its next version) to write to a signal file in the root's .dvc/tmp/exps/run directory on next_step(). The extension will be watching for updates to the new signal file to trigger both experiments and plots data updates.

@mattseddon
Copy link
Member

This should be fixed with the latest versions of DVC/DVCLive/the extension.

@aschuh-hf
Copy link
Author

Fantastic! Thanks for working out a solution across the different subprojects.

(also glad to have theDVC_ROOT environment variable back)

@mattseddon
Copy link
Member

For anyone else that comes across this issue the min required versions for the fix are:

  • DVC: 3.17.0
  • DVCLive: 2.16.0
  • Ext: 1.0.49

@aschuh-hf
Copy link
Author

aschuh-hf commented Sep 1, 2023

I am just realizing that this doesn't fix the issue for me, because DVCLive is not running on my local machine which is executing the queued experiment task (dvc exp queue + dvc queue start via the Extension).

Instead, my train stage command submits a train job to a remote cluster. The remote job runs DVCLive to produce the artifacts and uploads those to cloud storage. My DVC experiment command itself periodically syncs this cloud data with the local experiment temp folder content. It thus overrides metrics.json file and plots folder at regular intervals. But Live.next_step() will never be called on the local machine which is running the DVC experiment.

I can use the DVC_ROOT to touch a file in my DVC experiment command after the data has been synced as you suggested previously. I wonder if I should touch the new signal file instead of a file such as dvc.yaml in my workspace?

if aws s3 sync --only-show-errors "${s3_dvc_dir}" "${dvc_dir}/"; then
    [ $verbose -lt 2 ] || info "Synced DVCLive metrics and plots"
    [ -z "$DVC_ROOT" ] || touch "${DVC_ROOT}/.dvc/tmp/exps/run/DVCLIVE_STEP_COMPLETED"
fi

I can confirm that when I touch .dvc/tmp/exps/run/DVCLIVE_STEP_COMPLETED, the table updates.

My current preference would be to use the new signal file because it is outside my workspace that I may work on while the experiments are running. It would also not be too harmful if the signal file would be changed in later DVC versions as this would just mean it won't work in the future without changing my script, but it would only be the update of the Experiments table that would break (and I can link to this issue from the code for documentation on how to then resolve it).

What would you suggest?

@mattseddon
Copy link
Member

What would you suggest?

Seems like you have all of the information required to make a decision. I don't have anything to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Area: experiments table webview and everything related priority-p1 Regular product backlog
Projects
None yet
3 participants