Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show csv format of experiments #6468

Merged
merged 11 commits into from
Sep 8, 2021
Merged

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Aug 23, 2021

fix #5446

  1. add --show-csv to dvc exp show
  2. add tests for dvc exp show --show-csv

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

1. add --show-csv to dvc exp show
2. add tests for dvc exp show --show-csv
@karajan1001 karajan1001 requested a review from a team as a code owner August 23, 2021 01:59
@pmrowla
Copy link
Contributor

pmrowla commented Aug 23, 2021

Is just dumping the exact exp show CLI table content into a CSV actually the desired behavior here? What we get right now is:

Experiment,Created,avg_prec,roc_auc,prepare.split,prepare.seed,featurize.max_features,featurize.ngrams,train.seed,train.n_est,train.min_split,max_features
workspace,-,0.60405,0.9608,0.2,20170428,3000,2,20170428,100,64,2500
bee447d,"Jun 02, 2021",0.67038,0.96693,0.2,20170428,3000,2,20170428,100,64,-
master,"May 29, 2021",0.60405,0.9608,0.2,20170428,3000,2,20170428,100,64,-
└── ac627ad [exp-44136],01:14 PM,0.60405,0.9608,0.2,20170428,3000,2,20170428,100,64,2500
cc51022,"May 28, 2021",0.55259,0.91536,0.2,20170428,1500,2,20170428,50,2,-

It seems like the merged Experiment column from the show table should actually be broken up into separate columns for a CSV formatted output (i.e. having separate columns for SHA, name, and parent instead of using the tree character + sha + [name] format used in the CLI output).

Also it seems like we should be outputting the ISO8601 timestamps for the created column, instead of the formatted output we use in the CLI table.

It also seems like we should just be not outputting the - for empty cells as well (in a CSV you can just output the empty cell (so for col1,,col3 col2 is empty)

Basically I think we need to be doing more than just using the existing TabularData and calling to_csv() on it. If we are writing to CSV, the data itself needs to be formatted differently than if we are outputting to the CLI. For comparison, the JSON output for one of these experiment rows is:

{
  "ac627ad7b1996289695199c6c0ffb74bb1a57c9f":{
    "data":{
      "timestamp":"2021-08-23T13:14:33",
      "params":{
        "params.yaml":{
          "data":{
            "prepare":{
              "split":0.2,
              "seed":20170428
            },
            "featurize":{
              "max_features":3000,
              "ngrams":2
            },
            "train":{
              "seed":20170428,
              "n_est":100,
              "min_split":64
            },
            "max_features":2500
          }
        }
      },
      "queued":false,
      "running":false,
      "executor":null,
      "metrics":{
        "scores.json":{
          "data":{
            "avg_prec":0.6040544652105823,
            "roc_auc":0.9608017142900953
          }
        }
      },
      "name":"exp-44136"
    }
  }
}

You can see that in the JSON, we use full git SHAs (not shortened ones), full precision for floating point numbers, and detailed timestamps. I think it should be the same if we are writing to CSV. For CSV output we should probably also always be including the full path for metrics/params files instead of dropping the default path segment in column headers.

@karajan1001 @dberenbaum

dvc/command/experiments.py Outdated Show resolved Hide resolved
karajan1001 and others added 2 commits August 23, 2021 17:09
Co-authored-by: Peter Rowlands (λ³€κΈ°ν˜Έ) <[email protected]>
@dberenbaum
Copy link
Collaborator

It seems like the merged Experiment column from the show table should actually be broken up into separate columns for a CSV formatted output (i.e. having separate columns for SHA, name, and parent instead of using the tree character + sha + [name] format used in the CLI output).

Can we just drop

_post_process_td(td)
?

Also it seems like we should be outputting the ISO8601 timestamps for the created column, instead of the formatted output we use in the CLI table.

I think it's okay as is and some users may prefer human readable dates. Many csv readers will either automatically handle date parsing or will provide an option to provide the date format.

It also seems like we should just be not outputting the - for empty cells as well (in a CSV you can just output the empty cell (so for col1,,col3 col2 is empty)

This would be nice if it's not too much trouble. It could also probably be handled pretty easily by the user if it is too messy to do on the dvc side.

Basically I think we need to be doing more than just using the existing TabularData and calling to_csv() on it. If we are writing to CSV, the data itself needs to be formatted differently than if we are outputting to the CLI.

I think the differences between the table and --show-json are actually confusing and not well documented today. Users might expect --show-csv output to look as much as possible like the table output, especially since it's a similar structure.

I agree that some users will probably want output more similar to --show-json since it's better structured and more informative, but it seems reasonable to start with the existing table output since it's much simpler and probably matches some users' expectations ("I want this table as a csv").

@pmrowla
Copy link
Contributor

pmrowla commented Aug 24, 2021

Can we just drop

_post_process_td(td)

Yes

I think it's okay as is and some users may prefer human readable dates. Many csv readers will either automatically handle date parsing or will provide an option to provide the date format.

The thing is that the timestamps are timestamps though (date + time) and in the table we only display times for experiments that are from the current day, and dates for everything else (to match the default viewer behavior). This is fine for making the CLI table human readable but it seems like, it would still be more useful to have the date + time in the CSV output, even if we output it as human readable formatted "Month Day Year, hh:mm:ss" strings instead of an ISO8601 timestamp.

This would be nice if it's not too much trouble. It could also probably be handled pretty easily by the user if it is too messy to do on the dvc side.

This can be done on the DVC side by changing the default empty data placeholder string (we want to use "" instead of "-")

I agree that some users will probably want output more similar to --show-json since it's better structured and more informative, but it seems reasonable to start with the existing table output since it's much simpler and probably matches some users' expectations ("I want this table as a csv").

If this is really what users want then I don't have any objections to this PR, but I guess I don't really see the use case where users would want less detail in their CSV.

If users want to format columns in some particular way (string formatting for dates/float precision/etc) they can do it in their spreadsheet viewer. But especially with stuff like float precision - I can tell excel to only show 4 decimal places even if the original CSV data includes 10+ decimal places, but I cannot do it the other way around if DVC is only outputting the 4 decimal places to start with.

@dberenbaum
Copy link
Collaborator

@pmrowla All of your points make sense. I'm not sure how much additional work each of those changes are. If they take significant additional effort, I think it's acceptable to leave them as is since it has parity with the existing table output.

@skshetry
Copy link
Member

TabularData is just a 2D data structure, it has nothing to do with CLI. We should easily be able to skip merging headers for the row index, skip formatting timestamp, precision (this is already possible by passing precision=None). Also, the fill value is hardcoded now which can be changed to just drill down, so that we can specify different fill values for csv vs non-csv.

Regarding the headers, ideally, we should always be keeping the headers with full prefixes, and only normalize them just before rendering. It may be a bit more involved so we could skip it for now.

@karajan1001
Copy link
Contributor Author

@pmrowla @dberenbaum for the precision problem --precision works on --show-csv

@pmrowla
Copy link
Contributor

pmrowla commented Aug 30, 2021

@pmrowla @dberenbaum for the precision problem --precision works on --show-csv

Yes, but there is no way for users to specify full precision right now. It's currently set using

precision=self.args.precision or DEFAULT_PRECISION,

meaning that when --precision isn't used (and is None), it actually defaults to 5 instead of full precision.

This behavior is what we want for display in the CLI table by default, but not for a data format like CSV. It should default to full precision (meaning explicit precision=None) for --show-csv

1. fill_values default to `` in csv format
2. precisions default to None in csv format
3. output timestamp for it
4. fix iterative#5989
5. add a new functional test for csv format
@karajan1001 karajan1001 marked this pull request as ready for review August 31, 2021 09:24
param_headers = _normalize_headers(param_names)

names = {**metric_names, **param_names}
counter = Counter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for #5989
The duplicated column name foo in the test stage will cause a wrong output in test.

Copy link
Member

@skshetry skshetry Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001, let's do the same for other headers that we have (like Experiments, rev, etc.) to reduce chances of collision. You can hoist the headers from the following and reuse them here:

headers = [
"Experiment",
"rev",
"typ",
"Created",
"parent",
"State",
"Executor",
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we rename the column typ to Type?

Copy link
Member

@skshetry skshetry Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the lowercase names, should we capitalize it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the lowercase names, should we capitalize it?

for the revs and parent I think we should capitalize them, but for the user-defined ones, capitalization might cause confusion to the users, and make them hard to manage in code ( It is easy to capitalize a string but hard to recover it )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001, I was only talking about that particular list: {typ, rev, parent}.

Copy link
Contributor Author

@karajan1001 karajan1001 Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry
But one problem here, "rev" is consisitent with the some other functions for example in dvc/repo/plots/template.py, ./dvc/scm/git/__init__.py and ./dvc/api.py
while typ and parent are only used here. So I will only modify typ and parent here, still imperfect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they are not part of a UI, they are mostly part of an API or a schema where that makes sense. Please check dvc metrics show --all-commits for example.

Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should have @skshetry and @dberenbaum double check

@karajan1001 karajan1001 requested a review from skshetry September 3, 2021 07:01
Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸŽ‰

dvc/command/experiments.py Outdated Show resolved Hide resolved
dvc/command/experiments.py Outdated Show resolved Hide resolved
1. move iso format to front.
2. move several config logic out of `show_experiment`.
@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 7, 2021

Current exp show like this

$ dvc exp show --show-csv
Experiment,rev,typ,Created,parent,avg_prec,roc_auc,prepare.split,prepare.seed,featurize.max_features,featurize.ngrams,train.seed,train.n_est,train.min_split
,workspace,baseline,,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
master,b05eecc,baseline,2021-08-02T16:48:14,,0.5325162867864254,0.9106964878520005,0.2,20170428,3000,1,20170428,100,2
exp-44136,ae99936,branch_commit,2021-08-31T14:56:55,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-9fcef,8bc0b4d,branch_commit,2021-08-23T17:43:25,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-aaa23,4d5e611,branch_commit,2021-08-23T17:43:17,,0.5950970297562502,0.9554043071921983,0.2,20170428,3000,1,20170428,100,72
exp-8ae22,567c4b8,branch_commit,2021-08-23T17:43:10,,0.6037301973188752,0.9557358950572323,0.2,20170428,3000,1,20170428,100,64
exp-a5f46,358e946,branch_base,2021-08-23T17:43:06,,0.576325443740785,0.955705227971449,0.2,20170428,3000,1,20170428,100,128

cap = capsys.readouterr()
assert (
"Experiment,rev,typ,Created,parent,State,scores.json:"
"featurize.max_features,scores.json:featurize.ngrams,"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, some arguments appear twice in different files, it can be used for test #5989

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a duplicated parent column in params.yaml

@karajan1001
Copy link
Contributor Author

Current edition

Experiment,Rev,Type,Created,Parent,avg_prec,roc_auc,prepare.split,prepare.seed,featurize.max_features,featurize.ngrams,train.seed,train.n_est,train.min_split
,workspace,baseline,,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
master,b05eecc,baseline,2021-08-02T16:48:14,,0.5325162867864254,0.9106964878520005,0.2,20170428,3000,1,20170428,100,2
exp-44136,ae99936,branch_commit,2021-08-31T14:56:55,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-9fcef,8bc0b4d,branch_commit,2021-08-23T17:43:25,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-aaa23,4d5e611,branch_commit,2021-08-23T17:43:17,,0.5950970297562502,0.9554043071921983,0.2,20170428,3000,1,20170428,100,72
exp-8ae22,567c4b8,branch_commit,2021-08-23T17:43:10,,0.6037301973188752,0.9557358950572323,0.2,20170428,3000,1,20170428,100,64
exp-a5f46,358e946,branch_base,2021-08-23T17:43:06,,0.576325443740785,0.955705227971449,0.2,20170428,3000,1,20170428,100,128

@@ -471,3 +492,46 @@ def test_show_with_broken_repo(tmp_dir, scm, dvc, exp_stage, caplog):

paths = ["workspace", "baseline", "error"]
assert isinstance(get_in(result, paths), YAMLFileCorruptedError)


def test_show_csv(tmp_dir, scm, dvc, exp_stage, capsys):
Copy link
Member

@skshetry skshetry Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? Can this test be replaced with a mocked test that checks if show_experiments is being called correctly? WDYT? I don't have strong opinion though.

Eg:

def test_experiments_show(dvc, scm, mocker):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, In my previous version, what I planned is that we only test _show_csv (you asked why I had this function) being called, the value of all_experiments (already tested properly), and the to_csv function. But I didn't test the code between all_experiments and to_csv in show_experiments in it. And for now as the show_experiments had been tested fully, we can just test the call from interface to the show_experiments.

@karajan1001 karajan1001 merged commit 6106b13 into iterative:master Sep 8, 2021
@karajan1001 karajan1001 deleted the fix5446 branch September 8, 2021 08:59
@efiop
Copy link
Contributor

efiop commented Sep 13, 2021

@karajan1001 Looks like this is a little flaky:

2021-09-13T19:02:32.4081032Z =================================== FAILURES ===================================
2021-09-13T19:02:32.4082061Z ________________________________ test_show_csv _________________________________
2021-09-13T19:02:32.4083946Z [gw0] linux -- Python 3.9.6 /opt/hostedtoolcache/Python/3.9.6/x64/bin/python
2021-09-13T19:02:32.4085135Z 
2021-09-13T19:02:32.4086660Z tmp_dir = PosixTmpDir('/tmp/pytest-of-runner/pytest-0/popen-gw0/test_show_csv0')
2021-09-13T19:02:32.4090688Z scm = Git: '/tmp/pytest-of-runner/pytest-0/popen-gw0/test_show_csv0/.git'
2021-09-13T19:02:32.4092684Z dvc = Repo: '/tmp/pytest-of-runner/pytest-0/popen-gw0/test_show_csv0'
2021-09-13T19:02:32.4093854Z exp_stage = Stage: 'copy-file'
2021-09-13T19:02:32.4094866Z capsys = <_pytest.capture.CaptureFixture object at 0x7f281995ef10>
2021-09-13T19:02:32.4095580Z 
2021-09-13T19:02:32.4096279Z     def test_show_csv(tmp_dir, scm, dvc, exp_stage, capsys):
2021-09-13T19:02:32.4097073Z         baseline_rev = scm.get_rev()
2021-09-13T19:02:32.4097695Z     
2021-09-13T19:02:32.4098326Z         def _get_rev_isotimestamp(rev):
2021-09-13T19:02:32.4099176Z             return datetime.fromtimestamp(
2021-09-13T19:02:32.4100403Z                 scm.gitpython.repo.rev_parse(rev).committed_date
2021-09-13T19:02:32.4101301Z             ).isoformat()
2021-09-13T19:02:32.4101872Z     
2021-09-13T19:02:32.4102739Z         result1 = dvc.experiments.run(exp_stage.addressing, params=["foo=2"])
2021-09-13T19:02:32.4103660Z         rev1 = first(result1)
2021-09-13T19:02:32.4104405Z         ref_info1 = first(exp_refs_by_rev(scm, rev1))
2021-09-13T19:02:32.4105425Z         result2 = dvc.experiments.run(exp_stage.addressing, params=["foo=3"])
2021-09-13T19:02:32.4106363Z         rev2 = first(result2)
2021-09-13T19:02:32.4107086Z         ref_info2 = first(exp_refs_by_rev(scm, rev2))
2021-09-13T19:02:32.4107756Z     
2021-09-13T19:02:32.4108382Z         capsys.readouterr()
2021-09-13T19:02:32.4109421Z         assert main(["exp", "show", "--show-csv"]) == 0
2021-09-13T19:02:32.4110204Z         cap = capsys.readouterr()
2021-09-13T19:02:32.4110886Z         assert (
2021-09-13T19:02:32.4112108Z             "Experiment,rev,typ,Created,parent,metrics.yaml:foo,params.yaml:foo"
2021-09-13T19:02:32.4115556Z             in cap.out
2021-09-13T19:02:32.4116708Z         )
2021-09-13T19:02:32.4117749Z         assert ",workspace,baseline,,,3,3" in cap.out
2021-09-13T19:02:32.4118815Z         assert (
2021-09-13T19:02:32.4119855Z             "master,***,baseline,***,,1,1".format(
2021-09-13T19:02:32.4121030Z                 baseline_rev[:7], _get_rev_isotimestamp(baseline_rev)
2021-09-13T19:02:32.4122034Z             )
2021-09-13T19:02:32.4122852Z             in cap.out
2021-09-13T19:02:32.4123651Z         )
2021-09-13T19:02:32.4124450Z >       assert (
2021-09-13T19:02:32.4125381Z             "***,***,branch_base,***,,2,2".format(
2021-09-13T19:02:32.4126502Z                 ref_info1.name, rev1[:7], _get_rev_isotimestamp(rev1)
2021-09-13T19:02:32.4127487Z             )
2021-09-13T19:02:32.4128288Z             in cap.out
2021-09-13T19:02:32.4129087Z         )
2021-09-13T19:02:32.4132373Z E       AssertionError: assert 'exp-42ced,b84cb50,branch_base,2021-09-13T19:01:43,,2,2' in 'Experiment,rev,typ,Created,parent,metrics.yaml:foo,params.yaml:foo\r\n,workspace,baseline,,,3,3\r\nmaster,9279735,bas...exp-42ced,b84cb50,branch_commit,2021-09-13T19:01:43,,2,2\r\nexp-e0d04,755d134,branch_base,2021-09-13T19:01:43,,3,3\r\n'
2021-09-13T19:02:32.4137095Z E        +  where 'exp-42ced,b84cb50,branch_base,2021-09-13T19:01:43,,2,2' = <built-in method format of str object at 0x7f2903a639e0>('exp-42ced', 'b84cb50', '2021-09-13T19:01:43')
2021-09-13T19:02:32.4139193Z E        +    where <built-in method format of str object at 0x7f2903a639e0> = '***,***,branch_base,***,,2,2'.format
2021-09-13T19:02:32.4141910Z E        +    and   'exp-42ced' = ExpRefInfo(baseline_sha='9279735db68b206e8725ba24c7b28fab99d3b93e', name='exp-42ced').name
2021-09-13T19:02:32.4144543Z E        +    and   '2021-09-13T19:01:43' = <function test_show_csv.<locals>._get_rev_isotimestamp at 0x7f2819b9fd30>('b84cb506452c41f40b30a22d8a43a3b547fa2d0c')
2021-09-13T19:02:32.4149434Z E        +  and   'Experiment,rev,typ,Created,parent,metrics.yaml:foo,params.yaml:foo\r\n,workspace,baseline,,,3,3\r\nmaster,9279735,bas...exp-42ced,b84cb50,branch_commit,2021-09-13T19:01:43,,2,2\r\nexp-e0d04,755d134,branch_base,2021-09-13T19:01:43,,3,3\r\n' = CaptureResult(out='Experiment,rev,typ,Created,parent,metrics.yaml:foo,params.yaml:foo\r\n,workspace,baseline,,,3,3\r\n...36mhttps://man.dvc.org/config#core\x1b[39m>\r                                                                      \r').out
2021-09-13T19:02:32.4152843Z 
2021-09-13T19:02:32.4153848Z /home/runner/work/dvc/dvc/tests/func/experiments/test_show.py:526: AssertionError

could you take a look, please?

@karajan1001 karajan1001 mentioned this pull request Sep 14, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exp show csv/tsv option
5 participants