Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiments: parallel coordinates plot #4455

Closed
dmpetrov opened this issue Aug 24, 2020 · 12 comments · Fixed by #6933
Closed

experiments: parallel coordinates plot #4455

dmpetrov opened this issue Aug 24, 2020 · 12 comments · Fixed by #6933
Assignees
Labels
A: experiments Related to dvc exp diff/show Related to the diff/show feature feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@dmpetrov
Copy link
Member

I'd love to see a parallel coordinates plot for an experiment.

$ dvc exp show --no-pager --include-metrics accuracy --include-params epochs,opt.lr --all-commits
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Experiment                        ┃ accuracy ┃ epochs ┃ opt.lr ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ workspace                         │  0.48982 │ 2      │ 0.0    │
│ 794e714 (2020-08-24 03:01:44)     │  0.48982 │ 2      │ 0.0    │
│ ├── a0f6717 (2020-08-24 03:17:31) │  0.84703 │ 2      │ 1e-05  │
│ ├── 59b2759 (2020-08-24 03:12:42) │  0.86678 │ 2      │ 1e-05  │
│ ├── 26656e0 (2020-08-24 03:16:02) │  0.67587 │ 2      │ 0.0    │
│ └── 0a81605 (2020-08-24 03:16:42) │  0.81715 │ 2      │ 1e-05  │
│ 650049c (2020-08-21 20:43:31)     │  0.48607 │ 2      │ -      │
...
$ dvc exp plots 794e714 # show all ephemeral commits of a given commit as a  parallel coordinates plot 
file:///..../plot.png

$ dvc exp plots a0f6717 59b2759 ba39f4  # given set of ephemeral commits
file:///..../plot.png

$ dvc exp plots a0f6717 59b2759 ba39f4 --order epochs,opt.lr,accuracy  # change the order
file:///..../plot.png

image

@dmpetrov dmpetrov added the feature request Requesting a new feature label Aug 24, 2020
@efiop efiop added the p2-medium Medium priority, should be done, but less important label Aug 24, 2020
@pared
Copy link
Contributor

pared commented Aug 25, 2020

Seems like it will involve creating a new template.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 28, 2020

See https://vega.github.io/vega-lite/examples/parallel_coordinate.html for example parallel coordinates plot schema

This will also require adding support for parameters in DVC plots, my understanding is that we cannot currently plot values from params.yaml since it is not a DVC out
@pared can you confirm this?

@pared
Copy link
Contributor

pared commented Aug 28, 2020

@pmrowla That is correct, currently we are using find_outs_by_path to gather plot data. We probably should generalize it to accept any viable file.
Related: #4446

@pared
Copy link
Contributor

pared commented Oct 27, 2020

NOTE: accepting any viable file is already implemented in #4446

@daavoo
Copy link
Contributor

daavoo commented Sep 9, 2021

Would this belong in a separated plots command or could it be an optional flag in dvc exp show?

It feels like parallel plots don't make much sense without experiments and the implementation would probably be easier if we handle the generation as an optional flag of dvc exp show like the existing --show-json and --show-csv. Something like --show-plot?

@daavoo daavoo self-assigned this Oct 8, 2021
@daavoo daavoo added diff/show Related to the diff/show feature p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Oct 14, 2021
@daavoo
Copy link
Contributor

daavoo commented Oct 18, 2021

So after some research here is my idea on how to implement a minimum viable version:

Tabular datasets can and should be able to use this kind of plot (as shown in the example vega lite where raw tabular data is plotted).

For the case of experiments, the template would be used in dvc exp show to render data extracted from the internal TabularData (not clear how yet, i.e. calling directly dvc.render.write?).

The internal TabularData contains all the information required to fill the data anchor of the template (including metrics, params and revisions), and this would make the options already existing for dvc exp show to be applied to the parallel coordinates plot (i.e. --include-metrics, etc.)

This plot doesn't fit the usual schema of anchors we use for other plots . For this kind of plot, we need to have anchors for: raw data; multiple columns to display (fold field in the linked template); property for encoding color (Species used in the example).

This feels kind of blocked by #5980 although there could be workarounds with the existing functionality.

  • Extending TabularData functions.

After trying different external tools for generating parallel coordinate plots, it looks that the better way would be to extend TabularData with some functions to clean the data before generating the plot. The following functions are usually applied automatically to the data in many external tools:

  1. Dropping NaNs (Need to decide on defaulting to cols and/or rows). At least for rendering in vega, dropping fees like the easiest way to handle NaNs. Other options feel tricky to implement in the vega template. TabularData: Add dropna method #6812

  2. Dropping duplicates (Need to decide on defaulting to cols and/or rows). Implementing this in TabularData and exposing it as a CLI arg could fix exp show: show only changed parameters #5966 (would be cols for that case).

  3. Dropping categorical (non-scalar) columns. For some tools, there is support or workarounds for mixing scalar and categorical (i.e. plotly) . For vega template supporting this feels tricky so again feels like dropping would be the fastest way to minimum viable version. Currently TabularData does not store information about the original types so it should also be extended to store the type associated with each column.

CC @pared @dberenbaum

@dberenbaum
Copy link
Collaborator

What field will be used to determine line colors?

@daavoo
Copy link
Contributor

daavoo commented Oct 19, 2021

What field will be used to determine line colors?

For the experiments use case, I was thinking of using the rev by default (i.e the experiment id)

@daavoo daavoo added this to DVC Oct 19, 2021
@daavoo daavoo moved this to Backlog in DVC Oct 19, 2021
@daavoo daavoo moved this from Backlog to Todo in DVC Oct 19, 2021
@dberenbaum
Copy link
Collaborator

Do you plan to make it configurable? I'm guessing it could be helpful to users to color by one of the metrics.

@pared
Copy link
Contributor

pared commented Oct 19, 2021

For the case of experiments, the template would be used in dvc exp show to render data extracted from the internal TabularData (not clear how yet, i.e. calling directly dvc.render.write?).

For initial implementation we could support parallel coordinates plot only for exp command, in that case we should be able to achieve it with internal dvc.render.write call.

This plot doesn't fit the usual schema of anchors we use for other plots . For this kind of plot, we need to have anchors for: raw data; multiple columns to display (fold field in the linked template); property for encoding color (Species used in the example).

We had a talk (with @daavoo) about the data and it seems the best abstraction is to provide VegaRenderers with list of datapoints. Vega templates requires as to provide a list of datapoints, which we create from repo.plots.show data inside VegaRenderer. We could move this behavior outside of VegaRenderer - not only it will help with new plots implementation (TabularData provided by exp is basically a csv, which is a list of datapoints) but also would help studio which currently produces datapoints on their own.

@daavoo
Copy link
Contributor

daavoo commented Oct 19, 2021

Not gonna lie, I have been trying out the vega-lite example, adapting it to be used as a template, and just something that seems simple like considering a scalar vs categorical column for coloring ended being not that simple (maybe my lack of experience with vega-lite is a bias here).

Even though iterative/dvc-render#7 might look like a deviation coming from nowhere I actually think that it could be considered a product pre-requisite for the parallel coordinates plot (and actually at a good cost-opportunity tradeoff, IMO). Leaving apart my lack of vega-lite skills, it seems even more relevant when considering some ideas/requisites that were collected in the initial Studio proposal (mixing categorical and scalar data, column reordering, subset selection)

@pared
Copy link
Contributor

pared commented Oct 20, 2021

@daavoo It also might be the fact that vega-lite was not designed for such operations:

Though Vega-Lite supports only one scale per axes, one can create a parallel coordinate plot by folding variables, using joinaggregate to normalize their values and using ticks and rules to manually create axes.

Maybe it would be easier to actually use pure vega, though I haven't been playing with it yet.

daavoo added a commit that referenced this issue Nov 15, 2021
New renderer based on plotly. Not exposed to `dvc plots`.
Generate plotly datapoints from `TabularData`.

pre-requisite #4455
daavoo added a commit that referenced this issue Nov 15, 2021
Uses `ParallelCoordinatesRenderer` and `dvc.render.html.write`.

pre-requisite #4455
daavoo added a commit that referenced this issue Nov 16, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `color-by`, `out`, `open`

Closes #4455
@daavoo daavoo moved this from Review In Progress to Done in DVC Nov 16, 2021
@daavoo daavoo removed this from DVC Nov 16, 2021
daavoo added a commit that referenced this issue Nov 18, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `color-by`, `out`, `open`

Closes #4455
daavoo added a commit that referenced this issue Nov 29, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `color-by`, `out`, `open`

Closes #4455
daavoo added a commit that referenced this issue Dec 1, 2021
New renderer based on plotly. Not exposed to `dvc plots`.
Generate plotly datapoints from `TabularData`.

pre-requisite #4455
daavoo added a commit that referenced this issue Dec 1, 2021
Uses `ParallelCoordinatesRenderer` and `dvc.render.html.write`.

pre-requisite #4455
daavoo added a commit that referenced this issue Dec 1, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `out`, `open`

Reuses `--sort-by` to define colorscale.

Closes #4455
daavoo added a commit that referenced this issue Dec 2, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `out`, `open`

Reuses `--sort-by` to define colorscale.

Closes #4455
daavoo added a commit that referenced this issue Dec 15, 2021
New renderer based on plotly. Not exposed to `dvc plots`.
Generate plotly datapoints from `TabularData`.

pre-requisite #4455
daavoo added a commit that referenced this issue Dec 15, 2021
Uses `ParallelCoordinatesRenderer` and `dvc.render.html.write`.

pre-requisite #4455
daavoo added a commit that referenced this issue Dec 15, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `out`, `open`

Reuses `--sort-by` to define colorscale.

Closes #4455
pmrowla pushed a commit that referenced this issue Dec 17, 2021
New renderer based on plotly. Not exposed to `dvc plots`.
Generate plotly datapoints from `TabularData`.

pre-requisite #4455
pmrowla pushed a commit that referenced this issue Dec 17, 2021
Uses `ParallelCoordinatesRenderer` and `dvc.render.html.write`.

pre-requisite #4455
pmrowla pushed a commit that referenced this issue Dec 17, 2021
Uses `TabularData.to_parallel_coordinates`.
Adds new arguments: `html`, `out`, `open`

Reuses `--sort-by` to define colorscale.

Closes #4455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp diff/show Related to the diff/show feature feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants