Better support plots functionality #1274

mattseddon · 2022-01-31T22:16:47Z

Original use case

Give the user the ability to view any plot(s) from the current workspace. I.e any plot generated between HEAD & workspace, including all experiments and checkpoints.

Plots current state

exp show data is used to gather all of the revisions in the current workspace i.e most recent commit + all experiments & checkpoints.
Any missing revisions are requested from plots diff <REVISIONS> --show-json -o .dvc/tmp/plots.
We store paths to images in memory and also split out the revision data from their template and cache that in memory as well.
We "re-assemble" (vega) plots from the template + revision data before displaying it to the user (in much the same way Studio does).
Image paths/revision data is dumped from memory whenever a revision is no longer present in the exp show data (e.g after a commit is made).
When the VS Code session ends the temporary .dvc/tmp/plots folder is removed from disk.

Limitations (in order of priority):

plots diff returns revision data baked into the template we then have to manually split as per 3. This is a hack at best and we will not be able to rely on it when the extension starts accepting more than a handle of predefined templates.
Revisions for running experiments that have been queued are not returned as expected**
plots diff: duplicate revisions not returned dvc#7265

** When an experiment has been queued and is then running under "executor": "temp" the appropriate "live" data is available under .dvc/tmp/exps/<TEMP_DIR_NAME>/path/to/file as opposed to path/to/file. Until such time that the experiment has been completed plots diff will return the data for the parent revision. These two videos demonstrate what is shown in the extension when a repo (without checkpoints) that has "live" plots has an experiment running that was queued:

cc9db9e (running) matches b137fa8:

Screen.Recording.2022-01-31.at.5.22.41.pm.mov

cc9db9e completes and the final data is copied into the workspace:

Screen.Recording.2022-01-31.at.5.23.03.pm.mov

Proposed solution:

I quickly talked to @pawel on this and this was the provisional idea that we came up with:

Add an extra flag to plots show that provides only the half baked templates with a path to insert the data into the template (saves us scanning for an anchor).
Have exp show return the plots data for each revision that it sends. This would greatly simplify the code on our end but also seems like the logical way to deal with the situation of an experiment running outside of the current workspace.

cc @efiop @dberenbaum

The text was updated successfully, but these errors were encountered:

mattseddon · 2022-01-31T22:19:26Z

Relates to #1256

mattseddon · 2022-02-01T00:01:27Z

Some further info on trying to "live update" plots when running a non-checkpoint experiment in the workspace.

Running a 10 epoch experiment against example-dvc-experiments (non-checkpoints) in the workspace and then calling plots diff --show-json -o .dvc/tmp/plots yields this on the first epoch:

~/example-dvc-experiments @f5f308f5 !4 ❯ dvc plots diff -o .dvc/tmp/plots --show-json
DVC failed to load some plots for following revisions: 'workspace'.
{
  "logs.csv": [
    {
      "type": "vega",
      "revisions": [
        "workspace",
        "refs/exps/exec/EXEC_BASELINE"
      ],
      "content": {
        "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
        "data": {
          "values": [
            {
              " acc": " 0.8298666477203369",
              " loss": " 0.48006996512413025",
              " val_acc": " 0.8737999796867371",
              " val_loss": " 0.3617594540119171",
              "epoch": "1",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 0
            },
            {
              " acc": " 0.8772833347320557",
              " loss": " 0.3410544991493225",
              " val_acc": " 0.8863999843597412",
              " val_loss": " 0.31630298495292664",
              "epoch": "2",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 1
            },
            {
              " acc": " 0.8890666961669922",
              " loss": " 0.3060307502746582",
              " val_acc": " 0.8928999900817871",
              " val_loss": " 0.2947954833507538",
              "epoch": "3",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 2
            },
            {
              " acc": " 0.8975833058357239",
              " loss": " 0.28065934777259827",
              " val_acc": " 0.8963000178337097",
              " val_loss": " 0.2771669328212738",
              "epoch": "4",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 3
            },
            {
              " acc": " 0.9055333137512207",
              " loss": " 0.2595141530036926",
              " val_acc": " 0.9053000211715698",
              " val_loss": " 0.2615179717540741",
              "epoch": "5",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 4
            },
            {
              " acc": " 0.9101333618164062",
              " loss": " 0.2426270693540573",
              " val_acc": " 0.9064000248908997",
              " val_loss": " 0.2575400173664093",
              "epoch": "6",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 5
            },
            {
              " acc": " 0.9144333600997925",
              " loss": " 0.22980111837387085",
              " val_acc": " 0.9067999720573425",
              " val_loss": " 0.2509685158729553",
              "epoch": "7",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 6
            },
            {
              " acc": " 0.9188666939735413",
              " loss": " 0.21584856510162354",
              " val_acc": " 0.9067000150680542",
              " val_loss": " 0.24992917478084564",
              "epoch": "8",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 7
            },
            {
              " acc": " 0.9227333068847656",
              " loss": " 0.2055625468492508",
              " val_acc": " 0.9146000146865845",
              " val_loss": " 0.24114982783794403",
              "epoch": "9",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 8
            },
            {
              " acc": " 0.9268500208854675",
              " loss": " 0.19567501544952393",
              " val_acc": " 0.9157000184059143",
              " val_loss": " 0.2405654340982437",
              "epoch": "10",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 9
            }
          ]
        },
     ...template
    }
  ]
}

The output remains the same until the experiment is complete. I am not even sure where this experiment is being run because nothing is coming through any of the watchers. This would be another reason to have the plots data come through the exp show API.

edit: obviously this does not work because the project does not have a logger (e.g dvclive) setup.

dberenbaum · 2022-02-01T18:57:33Z

3. Revisions for running experiments that have been queued are not returned as expected**

If I understand, this issue is limited to a scenario where:

The experiment has been queued or is otherwise run in a temporary directory.
The experiment is still running.
The experiment has plots that need to be updated live (training over multiple epochs and appending to the plots file in each epoch, which dvclive.next_step() does).
The experiment is not using checkpoints (see Live plots without checkpoints #1256).

Is that correct?

mattseddon · 2022-02-01T21:18:16Z

If I understand, this issue is limited to a scenario where:

The experiment has been queued or is otherwise run in a temporary directory.

The experiment is still running.

The experiment has plots that need to be updated live (training over multiple epochs and appending to the plots file in each epoch, which dvclive.next_step() does).

The experiment is not using checkpoints (see Live plots without checkpoints #1256).

Is that correct?

Yep, I think that is the only scenario that we could be getting updates but we currently can't. The table shows all of the permutations that I can think of at the moment. ✅ = can get updated ❌ = not currently possible.

Run method	Checkpoints	No checkpoints w/logger	No checkpoints no logger
workspace	✅	✅ **	❌
queue	✅	❌	❌

Can you think of any scenarios that I've missed? I realised that the whole queuing system is experimental at the moment and will be getting worked on soon. Let's talk about this as I think it would be very beneficial to nail down which are the most important scenarios.

** Tested by adding DvcLiveCallback to example-dvc-experiments train script`

Screen.Recording.2022-02-02.at.9.48.33.am.mov

dberenbaum · 2022-02-01T22:22:29Z

Notes from our meeting:

plots diff returns revision data baked into the template we then have to manually split as per 3. This is a hack at best and we will not be able to rely on it when the extension starts accepting more than a handle of predefined templates.

This is the highest priority at the moment, so let's focus on this for now.

2. Revisions for running experiments that have been queued are not returned as expected

This is the scenario in the table above for "queue + no checkpoints w/ logger." Since this isn't as high priority, we don't need to make any decision on it yet, but it's unclear to me if this is a blocker for initial release since it seems like a more advanced scenario.

@daavoo might have thoughts on both the importance and possible implementation for this.

dberenbaum · 2022-02-02T01:00:27Z

Some initial thoughts on live plots for queued/temp experiments without checkpoints. I don't think exp show will be any more helpful than plots since both are collecting information from git revisions. Maybe in the future when dvc needs to get status updates for experiments on remote executors, dvc will collect this level of detail about running experiments, but I don't think it's possible today. The only way I know to keep track of live updates to running experiments is to treat each .dvc/tmp/exps/<TEMP_DIR_NAME> like the workspace, watching for updates and running plots from inside each temp dir.

mattseddon · 2022-02-02T01:09:05Z

Maybe in the future when dvc needs to get status updates for experiments on remote executors, dvc will collect this level of detail about running experiments, but I don't think it's possible today. The only way I know to keep track of live updates to running experiments is to treat each .dvc/tmp/exps/<TEMP_DIR_NAME> like the workspace, watching for updates and running plots from inside each temp dir.

We do already get the updates from the temp directories coming through. We even call for a specific revision relating to the running experiment (e.g dvc plots diff a43650e -o .dvc/tmp/plots --show-json). Where the process falls down is that plots diff has no idea what that revision is or where to find it. I can even see the data being generated under something like:

file:///example-dvc-experiments/.dvc/tmp/exps/tmppsgp3kkh/logs_dvc_plots/index.html

The bit of plumbing that is missing is the mapping of the temp directory to the revision. Once the running experiment finishes all of the data shows up in the workspace and the plots are updated "in bulk".

LMK if that doesn't make sense.

dberenbaum · 2022-02-02T16:11:59Z

The bit of plumbing that is missing is the mapping of the temp directory to the revision.

The temp dir info should be in .dvc/tmp/exps/run/<rev_num>/<rev_num>.run, so it should be possible to get the temp dir info there, cd into the temp dir, and run dvc plots.

@pmrowla How stable is this for finding the temp dir where an experiment is running? The VS Code team wants to run dvc plots on in-progress non-checkpoint experiments to get updates to dvclive plots. Any feedback or ideas would be appreciated!

daavoo · 2022-02-02T18:44:55Z

@daavoo might have thoughts on both the importance and possible implementation for this.

For me, workspace + no checkpoints w/ logger is the most relevant scenario as it covers the more consolidated/frequent workflow of DVC, running pipelines with dvc repro or plain dvc exp run.

I could not tell about queue + no checkpoints w/ logger because in my work I have always considered:

live tracking to be meaningful for computationally expensive (meaning, takes time to run) experiments.
parallel execution/scheduling (queueing) of these computationally expensive experiments to be a non-local thing (i.e. tasks are distributed to GPU server or cloud)

So, live tracking of locally queued experiments is a scenario I haven't really explored in practice.

mattseddon · 2022-02-02T21:41:29Z

The temp dir info should be in .dvc/tmp/exps/run/<rev_num>/<rev_num>.run, so it should be possible to get the temp dir info there, cd into the temp dir, and run dvc plots.

@dberenbaum we can use the approach of reading that file and processing the JSON and using the information to cd as a temporary patch but (seeing as that is relevant information) I would expect it to come through in the exp show output for the experiment. The reason is that then we don't have to rely on the underlying implementation.

pmrowla · 2022-02-03T05:52:54Z

@pmrowla How stable is this for finding the temp dir where an experiment is running? The VS Code team wants to run dvc plots on in-progress non-checkpoint experiments to get updates to dvclive plots. Any feedback or ideas would be appreciated!

This should not be considered stable right now, and the directory/file structure will probably continue to change in the near future, especially while the queueing work is ongoing.

But eventually, the idea is that yes, we will have some kind of serialized information where consumers can lookup status info for what is running and where it's being run. So in theory at that point the vscode extension could get the live plots data from the temp dir instead of needing it to all be fetched/collected by DVC into the main repo.

dberenbaum · 2022-02-03T13:30:10Z

I would expect it to come through in the exp show output for the experiment

Do you expect to have a way to find the location where the experiment is running, or do you expect exp show to include the plots data for each experiment? AFAIK the experiment location is in scope for what @pmrowla is doing but plots data for each experiment is not.

mattseddon · 2022-02-03T22:12:05Z

Do you expect to have a way to find the location where the experiment is running, or do you expect exp show to include the plots data for each experiment? AFAIK the experiment location is in scope for what @pmrowla is doing but plots data for each experiment is not.

I would be happy with the location as a short term solution.

I am unsure as to what the long term solution should be. I agree that having plots data in the exp show would dramatically bloat the output and it is never going to provide any benefit to the cli table (because where would it go).

parallel execution/scheduling (queueing) of these computationally expensive experiments to be a non-local thing (i.e. tasks are distributed to GPU server or cloud)

Can I ask what would be expected for plots in terms of remote execution? My expectation would be that I could see live updates for multiple experiments running in the cloud. With that in mind maybe we would want to add a --plots flag to the exp show command because we don't want to be making multiple calls to a remote machine.

dberenbaum · 2022-02-04T22:16:54Z

Can I ask what would be expected for plots in terms of remote execution? My expectation would be that I could see live updates for multiple experiments running in the cloud. With that in mind maybe we would want to add a --plots flag to the exp show command because we don't want to be making multiple calls to a remote machine.

👍 That's a good question, and your proposal makes sense. I haven't put much thought into this yet. There's no expectation in any DVC proposals so far that users could see live plots updates for non-checkpoint experiments running in the cloud, and I would say we have much more basic problems to solve first for remote execution 😁 . In dvclive, there are discussions about how to provide regular notifications/updates: iterative/dvclive#90, which may be enough for users who want to keep tabs on remote experiments.

So, live tracking of locally queued experiments is a scenario I haven't really explored in practice.

👍 Queuing local experiments is more of a prerequisite for remote execution than a fully realized feature right now, but a typical workflow for me would have been to log in to a large cloud instance/cluster and run multiple experiments there in parallel. I think plenty of users are running dvc inside cloud instances rather than on their laptops, so "local" execution may not be limited to laptop scenarios. However, I'm not sure how well the DVC VS Code extension would work in that remote-ssh scenario 🤔 .

mattseddon · 2022-02-06T23:11:17Z

👍 Queuing local experiments is more of a prerequisite for remote execution than a fully realized feature right now, but a typical workflow for me would have been to log in to a large cloud instance/cluster and run multiple experiments there in parallel. I think plenty of users are running dvc inside cloud instances rather than on their laptops, so "local" execution may not be limited to laptop scenarios. However, I'm not sure how well the DVC VS Code extension would work in that remote-ssh scenario 🤔 .

This is something that VS Code does well: https://code.visualstudio.com/docs/remote/ssh. We should be able to piggyback that behaviour 👍🏻 .

mattseddon · 2022-05-30T06:01:13Z

The bulk of the actionable items here have been covered.

In order to better support plots, we will now need to complete #1689, #1643, #1757 & #1117. Closing this now.

mattseddon added enhancement New feature or request discussion A: plots Area: plots webview, side panel and everything related A: integration Area: DVC integration layer labels Jan 31, 2022

mattseddon mentioned this issue Jan 31, 2022

Live plots without checkpoints #1256

Closed

mattseddon mentioned this issue May 10, 2022

plots diff: performance issues #1689

Closed

mattseddon closed this as completed May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support plots functionality #1274

Better support plots functionality #1274

mattseddon commented Jan 31, 2022 •

edited

Loading

mattseddon commented Jan 31, 2022

mattseddon commented Feb 1, 2022 •

edited

Loading

dberenbaum commented Feb 1, 2022

mattseddon commented Feb 1, 2022 •

edited

Loading

dberenbaum commented Feb 1, 2022 •

edited

Loading

dberenbaum commented Feb 2, 2022

mattseddon commented Feb 2, 2022

dberenbaum commented Feb 2, 2022

daavoo commented Feb 2, 2022 •

edited

Loading

mattseddon commented Feb 2, 2022

pmrowla commented Feb 3, 2022

dberenbaum commented Feb 3, 2022

mattseddon commented Feb 3, 2022

dberenbaum commented Feb 4, 2022

mattseddon commented Feb 6, 2022

mattseddon commented May 30, 2022

Better support plots functionality #1274

Better support plots functionality #1274

Comments

mattseddon commented Jan 31, 2022 • edited Loading

Original use case

Plots current state

Limitations (in order of priority):

Proposed solution:

mattseddon commented Jan 31, 2022

mattseddon commented Feb 1, 2022 • edited Loading

dberenbaum commented Feb 1, 2022

mattseddon commented Feb 1, 2022 • edited Loading

dberenbaum commented Feb 1, 2022 • edited Loading

dberenbaum commented Feb 2, 2022

mattseddon commented Feb 2, 2022

dberenbaum commented Feb 2, 2022

daavoo commented Feb 2, 2022 • edited Loading

mattseddon commented Feb 2, 2022

pmrowla commented Feb 3, 2022

dberenbaum commented Feb 3, 2022

mattseddon commented Feb 3, 2022

dberenbaum commented Feb 4, 2022

mattseddon commented Feb 6, 2022

mattseddon commented May 30, 2022

mattseddon commented Jan 31, 2022 •

edited

Loading

mattseddon commented Feb 1, 2022 •

edited

Loading

mattseddon commented Feb 1, 2022 •

edited

Loading

dberenbaum commented Feb 1, 2022 •

edited

Loading

daavoo commented Feb 2, 2022 •

edited

Loading