Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support plots functionality #1274

Closed
mattseddon opened this issue Jan 31, 2022 · 16 comments
Closed

Better support plots functionality #1274

mattseddon opened this issue Jan 31, 2022 · 16 comments
Labels
A: integration Area: DVC integration layer A: plots Area: plots webview, side panel and everything related discussion enhancement New feature or request

Comments

@mattseddon
Copy link
Member

mattseddon commented Jan 31, 2022

Original use case

Give the user the ability to view any plot(s) from the current workspace. I.e any plot generated between HEAD & workspace, including all experiments and checkpoints.

Plots current state

  1. exp show data is used to gather all of the revisions in the current workspace i.e most recent commit + all experiments & checkpoints.
  2. Any missing revisions are requested from plots diff <REVISIONS> --show-json -o .dvc/tmp/plots.
  3. We store paths to images in memory and also split out the revision data from their template and cache that in memory as well.
  4. We "re-assemble" (vega) plots from the template + revision data before displaying it to the user (in much the same way Studio does).
  5. Image paths/revision data is dumped from memory whenever a revision is no longer present in the exp show data (e.g after a commit is made).
  6. When the VS Code session ends the temporary .dvc/tmp/plots folder is removed from disk.

Limitations (in order of priority):

  1. plots diff returns revision data baked into the template we then have to manually split as per 3. This is a hack at best and we will not be able to rely on it when the extension starts accepting more than a handle of predefined templates.
  2. Revisions for running experiments that have been queued are not returned as expected**
  3. plots diff: duplicate revisions not returned dvc#7265

** When an experiment has been queued and is then running under "executor": "temp" the appropriate "live" data is available under .dvc/tmp/exps/<TEMP_DIR_NAME>/path/to/file as opposed to path/to/file. Until such time that the experiment has been completed plots diff will return the data for the parent revision. These two videos demonstrate what is shown in the extension when a repo (without checkpoints) that has "live" plots has an experiment running that was queued:

cc9db9e (running) matches b137fa8:

Screen.Recording.2022-01-31.at.5.22.41.pm.mov

cc9db9e completes and the final data is copied into the workspace:

Screen.Recording.2022-01-31.at.5.23.03.pm.mov

Proposed solution:

I quickly talked to @pawel on this and this was the provisional idea that we came up with:

Add an extra flag to plots show that provides only the half baked templates with a path to insert the data into the template (saves us scanning for an anchor).
Have exp show return the plots data for each revision that it sends. This would greatly simplify the code on our end but also seems like the logical way to deal with the situation of an experiment running outside of the current workspace.

cc @efiop @dberenbaum

@mattseddon mattseddon added enhancement New feature or request discussion A: plots Area: plots webview, side panel and everything related A: integration Area: DVC integration layer labels Jan 31, 2022
@mattseddon
Copy link
Member Author

Relates to #1256

@mattseddon
Copy link
Member Author

mattseddon commented Feb 1, 2022

Some further info on trying to "live update" plots when running a non-checkpoint experiment in the workspace.

Running a 10 epoch experiment against example-dvc-experiments (non-checkpoints) in the workspace and then calling plots diff --show-json -o .dvc/tmp/plots yields this on the first epoch:

~/example-dvc-experiments @f5f308f5 !4 ❯ dvc plots diff -o .dvc/tmp/plots --show-json
DVC failed to load some plots for following revisions: 'workspace'.
{
  "logs.csv": [
    {
      "type": "vega",
      "revisions": [
        "workspace",
        "refs/exps/exec/EXEC_BASELINE"
      ],
      "content": {
        "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
        "data": {
          "values": [
            {
              " acc": " 0.8298666477203369",
              " loss": " 0.48006996512413025",
              " val_acc": " 0.8737999796867371",
              " val_loss": " 0.3617594540119171",
              "epoch": "1",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 0
            },
            {
              " acc": " 0.8772833347320557",
              " loss": " 0.3410544991493225",
              " val_acc": " 0.8863999843597412",
              " val_loss": " 0.31630298495292664",
              "epoch": "2",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 1
            },
            {
              " acc": " 0.8890666961669922",
              " loss": " 0.3060307502746582",
              " val_acc": " 0.8928999900817871",
              " val_loss": " 0.2947954833507538",
              "epoch": "3",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 2
            },
            {
              " acc": " 0.8975833058357239",
              " loss": " 0.28065934777259827",
              " val_acc": " 0.8963000178337097",
              " val_loss": " 0.2771669328212738",
              "epoch": "4",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 3
            },
            {
              " acc": " 0.9055333137512207",
              " loss": " 0.2595141530036926",
              " val_acc": " 0.9053000211715698",
              " val_loss": " 0.2615179717540741",
              "epoch": "5",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 4
            },
            {
              " acc": " 0.9101333618164062",
              " loss": " 0.2426270693540573",
              " val_acc": " 0.9064000248908997",
              " val_loss": " 0.2575400173664093",
              "epoch": "6",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 5
            },
            {
              " acc": " 0.9144333600997925",
              " loss": " 0.22980111837387085",
              " val_acc": " 0.9067999720573425",
              " val_loss": " 0.2509685158729553",
              "epoch": "7",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 6
            },
            {
              " acc": " 0.9188666939735413",
              " loss": " 0.21584856510162354",
              " val_acc": " 0.9067000150680542",
              " val_loss": " 0.24992917478084564",
              "epoch": "8",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 7
            },
            {
              " acc": " 0.9227333068847656",
              " loss": " 0.2055625468492508",
              " val_acc": " 0.9146000146865845",
              " val_loss": " 0.24114982783794403",
              "epoch": "9",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 8
            },
            {
              " acc": " 0.9268500208854675",
              " loss": " 0.19567501544952393",
              " val_acc": " 0.9157000184059143",
              " val_loss": " 0.2405654340982437",
              "epoch": "10",
              "rev": "refs/exps/exec/EXEC_BASELINE",
              "step": 9
            }
          ]
        },
     ...template
    }
  ]
}

The output remains the same until the experiment is complete. I am not even sure where this experiment is being run because nothing is coming through any of the watchers. This would be another reason to have the plots data come through the exp show API.

edit: obviously this does not work because the project does not have a logger (e.g dvclive) setup.

@dberenbaum
Copy link
Contributor

3. Revisions for running experiments that have been queued are not returned as expected**

If I understand, this issue is limited to a scenario where:

  1. The experiment has been queued or is otherwise run in a temporary directory.
  2. The experiment is still running.
  3. The experiment has plots that need to be updated live (training over multiple epochs and appending to the plots file in each epoch, which dvclive.next_step() does).
  4. The experiment is not using checkpoints (see Live plots without checkpoints #1256).

Is that correct?

@mattseddon
Copy link
Member Author

mattseddon commented Feb 1, 2022

If I understand, this issue is limited to a scenario where:

  1. The experiment has been queued or is otherwise run in a temporary directory.
  2. The experiment is still running.
  3. The experiment has plots that need to be updated live (training over multiple epochs and appending to the plots file in each epoch, which dvclive.next_step() does).
  4. The experiment is not using checkpoints (see Live plots without checkpoints #1256).

Is that correct?

Yep, I think that is the only scenario that we could be getting updates but we currently can't. The table shows all of the permutations that I can think of at the moment. ✅ = can get updated ❌ = not currently possible.

Run method Checkpoints No checkpoints w/logger No checkpoints no logger
workspace ✅ **
queue

Can you think of any scenarios that I've missed? I realised that the whole queuing system is experimental at the moment and will be getting worked on soon. Let's talk about this as I think it would be very beneficial to nail down which are the most important scenarios.

** Tested by adding DvcLiveCallback to example-dvc-experiments train script`

Screen.Recording.2022-02-02.at.9.48.33.am.mov

@dberenbaum
Copy link
Contributor

dberenbaum commented Feb 1, 2022

Notes from our meeting:

  1. plots diff returns revision data baked into the template we then have to manually split as per 3. This is a hack at best and we will not be able to rely on it when the extension starts accepting more than a handle of predefined templates.

This is the highest priority at the moment, so let's focus on this for now.

2. Revisions for running experiments that have been queued are not returned as expected

This is the scenario in the table above for "queue + no checkpoints w/ logger." Since this isn't as high priority, we don't need to make any decision on it yet, but it's unclear to me if this is a blocker for initial release since it seems like a more advanced scenario.

@daavoo might have thoughts on both the importance and possible implementation for this.

@dberenbaum
Copy link
Contributor

Some initial thoughts on live plots for queued/temp experiments without checkpoints. I don't think exp show will be any more helpful than plots since both are collecting information from git revisions. Maybe in the future when dvc needs to get status updates for experiments on remote executors, dvc will collect this level of detail about running experiments, but I don't think it's possible today. The only way I know to keep track of live updates to running experiments is to treat each .dvc/tmp/exps/<TEMP_DIR_NAME> like the workspace, watching for updates and running plots from inside each temp dir.

@mattseddon
Copy link
Member Author

Maybe in the future when dvc needs to get status updates for experiments on remote executors, dvc will collect this level of detail about running experiments, but I don't think it's possible today. The only way I know to keep track of live updates to running experiments is to treat each .dvc/tmp/exps/<TEMP_DIR_NAME> like the workspace, watching for updates and running plots from inside each temp dir.

We do already get the updates from the temp directories coming through. We even call for a specific revision relating to the running experiment (e.g dvc plots diff a43650e -o .dvc/tmp/plots --show-json). Where the process falls down is that plots diff has no idea what that revision is or where to find it. I can even see the data being generated under something like:

file:///example-dvc-experiments/.dvc/tmp/exps/tmppsgp3kkh/logs_dvc_plots/index.html

The bit of plumbing that is missing is the mapping of the temp directory to the revision. Once the running experiment finishes all of the data shows up in the workspace and the plots are updated "in bulk".

LMK if that doesn't make sense.

@dberenbaum
Copy link
Contributor

The bit of plumbing that is missing is the mapping of the temp directory to the revision.

The temp dir info should be in .dvc/tmp/exps/run/<rev_num>/<rev_num>.run, so it should be possible to get the temp dir info there, cd into the temp dir, and run dvc plots.

@pmrowla How stable is this for finding the temp dir where an experiment is running? The VS Code team wants to run dvc plots on in-progress non-checkpoint experiments to get updates to dvclive plots. Any feedback or ideas would be appreciated!

@daavoo
Copy link
Contributor

daavoo commented Feb 2, 2022

@daavoo might have thoughts on both the importance and possible implementation for this.

For me, workspace + no checkpoints w/ logger is the most relevant scenario as it covers the more consolidated/frequent workflow of DVC, running pipelines with dvc repro or plain dvc exp run.

I could not tell about queue + no checkpoints w/ logger because in my work I have always considered:

  • live tracking to be meaningful for computationally expensive (meaning, takes time to run) experiments.
  • parallel execution/scheduling (queueing) of these computationally expensive experiments to be a non-local thing (i.e. tasks are distributed to GPU server or cloud)

So, live tracking of locally queued experiments is a scenario I haven't really explored in practice.

@mattseddon
Copy link
Member Author

The temp dir info should be in .dvc/tmp/exps/run/<rev_num>/<rev_num>.run, so it should be possible to get the temp dir info there, cd into the temp dir, and run dvc plots.

@dberenbaum we can use the approach of reading that file and processing the JSON and using the information to cd as a temporary patch but (seeing as that is relevant information) I would expect it to come through in the exp show output for the experiment. The reason is that then we don't have to rely on the underlying implementation.

@pmrowla
Copy link

pmrowla commented Feb 3, 2022

@pmrowla How stable is this for finding the temp dir where an experiment is running? The VS Code team wants to run dvc plots on in-progress non-checkpoint experiments to get updates to dvclive plots. Any feedback or ideas would be appreciated!

This should not be considered stable right now, and the directory/file structure will probably continue to change in the near future, especially while the queueing work is ongoing.

But eventually, the idea is that yes, we will have some kind of serialized information where consumers can lookup status info for what is running and where it's being run. So in theory at that point the vscode extension could get the live plots data from the temp dir instead of needing it to all be fetched/collected by DVC into the main repo.

@dberenbaum
Copy link
Contributor

I would expect it to come through in the exp show output for the experiment

Do you expect to have a way to find the location where the experiment is running, or do you expect exp show to include the plots data for each experiment? AFAIK the experiment location is in scope for what @pmrowla is doing but plots data for each experiment is not.

@mattseddon
Copy link
Member Author

Do you expect to have a way to find the location where the experiment is running, or do you expect exp show to include the plots data for each experiment? AFAIK the experiment location is in scope for what @pmrowla is doing but plots data for each experiment is not.

I would be happy with the location as a short term solution.

I am unsure as to what the long term solution should be. I agree that having plots data in the exp show would dramatically bloat the output and it is never going to provide any benefit to the cli table (because where would it go).

  • parallel execution/scheduling (queueing) of these computationally expensive experiments to be a non-local thing (i.e. tasks are distributed to GPU server or cloud)

Can I ask what would be expected for plots in terms of remote execution? My expectation would be that I could see live updates for multiple experiments running in the cloud. With that in mind maybe we would want to add a --plots flag to the exp show command because we don't want to be making multiple calls to a remote machine.

@dberenbaum
Copy link
Contributor

Can I ask what would be expected for plots in terms of remote execution? My expectation would be that I could see live updates for multiple experiments running in the cloud. With that in mind maybe we would want to add a --plots flag to the exp show command because we don't want to be making multiple calls to a remote machine.

👍 That's a good question, and your proposal makes sense. I haven't put much thought into this yet. There's no expectation in any DVC proposals so far that users could see live plots updates for non-checkpoint experiments running in the cloud, and I would say we have much more basic problems to solve first for remote execution 😁 . In dvclive, there are discussions about how to provide regular notifications/updates: iterative/dvclive#90, which may be enough for users who want to keep tabs on remote experiments.

So, live tracking of locally queued experiments is a scenario I haven't really explored in practice.

👍 Queuing local experiments is more of a prerequisite for remote execution than a fully realized feature right now, but a typical workflow for me would have been to log in to a large cloud instance/cluster and run multiple experiments there in parallel. I think plenty of users are running dvc inside cloud instances rather than on their laptops, so "local" execution may not be limited to laptop scenarios. However, I'm not sure how well the DVC VS Code extension would work in that remote-ssh scenario 🤔 .

@mattseddon
Copy link
Member Author

👍 Queuing local experiments is more of a prerequisite for remote execution than a fully realized feature right now, but a typical workflow for me would have been to log in to a large cloud instance/cluster and run multiple experiments there in parallel. I think plenty of users are running dvc inside cloud instances rather than on their laptops, so "local" execution may not be limited to laptop scenarios. However, I'm not sure how well the DVC VS Code extension would work in that remote-ssh scenario 🤔 .

This is something that VS Code does well: https://code.visualstudio.com/docs/remote/ssh. We should be able to piggyback that behaviour 👍🏻 .

@mattseddon
Copy link
Member Author

The bulk of the actionable items here have been covered.

In order to better support plots, we will now need to complete #1689, #1643, #1757 & #1117. Closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: integration Area: DVC integration layer A: plots Area: plots webview, side panel and everything related discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants