-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show csv format of experiments #6468
Conversation
1. add --show-csv to dvc exp show 2. add tests for dvc exp show --show-csv
Is just dumping the exact
It seems like the merged Also it seems like we should be outputting the ISO8601 timestamps for the created column, instead of the formatted output we use in the CLI table. It also seems like we should just be not outputting the Basically I think we need to be doing more than just using the existing {
"ac627ad7b1996289695199c6c0ffb74bb1a57c9f":{
"data":{
"timestamp":"2021-08-23T13:14:33",
"params":{
"params.yaml":{
"data":{
"prepare":{
"split":0.2,
"seed":20170428
},
"featurize":{
"max_features":3000,
"ngrams":2
},
"train":{
"seed":20170428,
"n_est":100,
"min_split":64
},
"max_features":2500
}
}
},
"queued":false,
"running":false,
"executor":null,
"metrics":{
"scores.json":{
"data":{
"avg_prec":0.6040544652105823,
"roc_auc":0.9608017142900953
}
}
},
"name":"exp-44136"
}
}
} You can see that in the JSON, we use full git SHAs (not shortened ones), full precision for floating point numbers, and detailed timestamps. I think it should be the same if we are writing to CSV. For CSV output we should probably also always be including the full path for metrics/params files instead of dropping the default path segment in column headers. |
Co-authored-by: Peter Rowlands (λ³κΈ°νΈ) <[email protected]>
Can we just drop dvc/dvc/command/experiments.py Line 468 in bbbf15c
I think it's okay as is and some users may prefer human readable dates. Many csv readers will either automatically handle date parsing or will provide an option to provide the date format.
This would be nice if it's not too much trouble. It could also probably be handled pretty easily by the user if it is too messy to do on the dvc side.
I think the differences between the table and I agree that some users will probably want output more similar to |
Yes
The thing is that the timestamps are timestamps though (date + time) and in the table we only display times for experiments that are from the current day, and dates for everything else (to match the default viewer behavior). This is fine for making the CLI table human readable but it seems like, it would still be more useful to have the date + time in the CSV output, even if we output it as human readable formatted "Month Day Year, hh:mm:ss" strings instead of an ISO8601 timestamp.
This can be done on the DVC side by changing the default empty data placeholder string (we want to use
If this is really what users want then I don't have any objections to this PR, but I guess I don't really see the use case where users would want less detail in their CSV. If users want to format columns in some particular way (string formatting for dates/float precision/etc) they can do it in their spreadsheet viewer. But especially with stuff like float precision - I can tell excel to only show 4 decimal places even if the original CSV data includes 10+ decimal places, but I cannot do it the other way around if DVC is only outputting the 4 decimal places to start with. |
@pmrowla All of your points make sense. I'm not sure how much additional work each of those changes are. If they take significant additional effort, I think it's acceptable to leave them as is since it has parity with the existing table output. |
TabularData is just a 2D data structure, it has nothing to do with CLI. We should easily be able to skip merging headers for the row index, skip formatting timestamp, precision (this is already possible by passing Regarding the headers, ideally, we should always be keeping the headers with full prefixes, and only normalize them just before rendering. It may be a bit more involved so we could skip it for now. |
@pmrowla @dberenbaum for the precision problem |
Yes, but there is no way for users to specify full precision right now. It's currently set using precision=self.args.precision or DEFAULT_PRECISION, meaning that when This behavior is what we want for display in the CLI table by default, but not for a data format like CSV. It should default to full precision (meaning explicit |
1. fill_values default to `` in csv format 2. precisions default to None in csv format 3. output timestamp for it 4. fix iterative#5989 5. add a new functional test for csv format
dvc/command/experiments.py
Outdated
param_headers = _normalize_headers(param_names) | ||
|
||
names = {**metric_names, **param_names} | ||
counter = Counter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for #5989
The duplicated column name foo
in the test stage will cause a wrong output in test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001, let's do the same for other headers that we have (like Experiments
, rev
, etc.) to reduce chances of collision. You can hoist the headers from the following and reuse them here:
dvc/dvc/command/experiments.py
Lines 341 to 349 in b3680b1
headers = [ | |
"Experiment", | |
"rev", | |
"typ", | |
"Created", | |
"parent", | |
"State", | |
"Executor", | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we rename the column typ
to Type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the lowercase names, should we capitalize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the lowercase names, should we capitalize it?
for the revs and parent I think we should capitalize them, but for the user-defined ones, capitalization might cause confusion to the users, and make them hard to manage in code ( It is easy to capitalize a string but hard to recover it )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001, I was only talking about that particular list: {typ, rev, parent}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry
But one problem here, "rev" is consisitent with the some other functions for example in dvc/repo/plots/template.py
, ./dvc/scm/git/__init__.py
and ./dvc/api.py
while typ
and parent
are only used here. So I will only modify typ
and parent
here, still imperfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they are not part of a UI, they are mostly part of an API or a schema where that makes sense. Please check dvc metrics show --all-commits
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, should have @skshetry and @dberenbaum double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π
1. move iso format to front. 2. move several config logic out of `show_experiment`.
Current exp show like this
|
cap = capsys.readouterr() | ||
assert ( | ||
"Experiment,rev,typ,Created,parent,State,scores.json:" | ||
"featurize.max_features,scores.json:featurize.ngrams," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, some arguments appear twice in different files, it can be used for test #5989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a duplicated parent
column in params.yaml
Current edition Experiment,Rev,Type,Created,Parent,avg_prec,roc_auc,prepare.split,prepare.seed,featurize.max_features,featurize.ngrams,train.seed,train.n_est,train.min_split
,workspace,baseline,,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
master,b05eecc,baseline,2021-08-02T16:48:14,,0.5325162867864254,0.9106964878520005,0.2,20170428,3000,1,20170428,100,2
exp-44136,ae99936,branch_commit,2021-08-31T14:56:55,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-9fcef,8bc0b4d,branch_commit,2021-08-23T17:43:25,,0.5843640011189556,0.9544670443829399,0.2,20170428,3000,1,20170428,100,36
exp-aaa23,4d5e611,branch_commit,2021-08-23T17:43:17,,0.5950970297562502,0.9554043071921983,0.2,20170428,3000,1,20170428,100,72
exp-8ae22,567c4b8,branch_commit,2021-08-23T17:43:10,,0.6037301973188752,0.9557358950572323,0.2,20170428,3000,1,20170428,100,64
exp-a5f46,358e946,branch_base,2021-08-23T17:43:06,,0.576325443740785,0.955705227971449,0.2,20170428,3000,1,20170428,100,128 |
@@ -471,3 +492,46 @@ def test_show_with_broken_repo(tmp_dir, scm, dvc, exp_stage, caplog): | |||
|
|||
paths = ["workspace", "baseline", "error"] | |||
assert isinstance(get_in(result, paths), YAMLFileCorruptedError) | |||
|
|||
|
|||
def test_show_csv(tmp_dir, scm, dvc, exp_stage, capsys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this test? Can this test be replaced with a mocked test that checks if show_experiments
is being called correctly? WDYT? I don't have strong opinion though.
Eg:
def test_experiments_show(dvc, scm, mocker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, In my previous version, what I planned is that we only test _show_csv
(you asked why I had this function) being called, the value of all_experiments
(already tested properly), and the to_csv
function. But I didn't test the code between all_experiments
and to_csv
in show_experiments
in it. And for now as the show_experiments
had been tested fully, we can just test the call from interface to the show_experiments
.
@karajan1001 Looks like this is a little flaky:
could you take a look, please? |
fix #5446
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. Add --show-csv flag to dvc exp showΒ dvc.org#2740
Thank you for the contribution - we'll try to review it as soon as possible. π