Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

params: allow tracking file #7432

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Mar 3, 2022

(will create the docs after #7413)

The syntax for tracking files is:

params:
 - params.yaml:
 - config.yaml:

Closes #4112

asciicast

@skshetry skshetry force-pushed the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch 3 times, most recently from 586213b to 6edd526 Compare March 16, 2022 12:48
@skshetry skshetry force-pushed the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch from 6edd526 to d761451 Compare March 16, 2022 13:16
@skshetry skshetry self-assigned this Mar 16, 2022
@skshetry skshetry requested a review from dberenbaum March 16, 2022 13:27
@skshetry skshetry added feature is a feature A: params Related to dvc params labels Mar 16, 2022
@skshetry skshetry marked this pull request as ready for review March 16, 2022 13:27
@skshetry skshetry requested a review from a team as a code owner March 16, 2022 13:27
@skshetry skshetry requested a review from karajan1001 March 16, 2022 13:27
Comment on lines +115 to +116
params = self.params or ldistinct([*actual.keys(), *info.keys()])
for param in params:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note here, this violates one of the principles on which dvc.yaml was built: that dvc.yaml is a single source of truth. Given that the feature that params provide is granular tracking, and given that we only have the filename of hyperparameters file in the dvc.yaml metadata, we don't know what we are supposed to be tracking without considering dvc.lock data.

Copy link
Collaborator

@dberenbaum dberenbaum Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than violate this principle, I'd prefer to just keep dvc status showing changes at the file level. Again, I think it's similar to how param keys are handled today. If I have a stage with param foo and modify the value of foo.bar, dvc status shows foo is modified, not foo.bar (even though we could get foo.bar from dvc.lock). Similarly, I think for tracking params.yaml:, dvc status should show params.yaml is modified, not any individual parameter key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider only showing modified on the file-level itself, but I think granular tracking of params to be an important feature that makes it distinct from other dependencies/outputs.

I did this only in status because status is not a static operation, but a very active operation that inspects the user's workspace. So it already has a context of what things are in the user's hyperparameters, so it seemed fine to also look into the lockfile (that helps us understand what was actually tracked previously so that we can provide new/deleted statuses which we kind of already did).
So, I consider violating that to be a minor issue here in the status.

dvc status should show params.yaml is modified, not any individual parameter key.

But again, I don't have a strong opinion here. Let me know if you feel that strongly here, should be a straightforward change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that helps us understand what was actually tracked previously so that we can provide new/deleted statuses which we kind of already did

Yeah, now that I think about it more, I'm not that clear on how this is different from what's already happening, which seems to be comparing parameters in dvc.yaml to dvc.lock.

But again, I don't have a strong opinion here. Let me know if you feel that strongly here, should be a straightforward change.

I would opt for whatever is the simplest and closest to the current implementation. We are about to do much larger changes in status soon, including separating data and stage status. We can revisit then when we have a clearer picture of what we want to do with this info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, now that I think about it more, I'm not that clear on how this is different from what's already happening, which seems to be comparing parameters in dvc.yaml to dvc.lock.

yes we already did that, but here in this case we don't know what parameters we are supposed to be comparing without actually looking into hyperparameters file and dvc.lock combined. As I said, it's a very minor issue and not very relevant in the status operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would opt for whatever is the simplest and closest to the current implementation. We are about to do much larger changes in status soon, including separating data and stage status. We can revisit then when we have a clearer picture of what we want to do with this info.

As we have this already implemented, we can just go with what we have. Like you said, we can revisit this without any issues in the future.

for param in self.params:
if param in values:
info[param] = values[param]
self.hash_info = HashInfo(self.PARAM_PARAMS, info)

def read_params(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_params_d and read_params was merged into one here.

Comment on lines +43 to 45
info = (
{self.PARAM_PARAMS: params} if isinstance(params, dict) else None
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are trying to detect difference between the absence of lock entry vs lock entry tracking empty contents.
So, we use None for absence of lock-entry and {} (empty dict) for empty contents.

Relevant test here: https://github.com/iterative/dvc/pull/7432/files#diff-f608e1da524603a471079fc0daa1b288fa3e5e1928d50ed346a365d9f7771f82R159

@skshetry skshetry requested review from dtrifiro and daavoo March 16, 2022 13:43
Comment on lines 183 to 185
if not self.isfile and not self.isdir:
raise self.IsNotFileOrDirError(self)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isfile and isdir are not properties, they are methods. I'll fix this in a separate PR.

Comment on lines +51 to +72
# figure out completely tracked params file, and ignore specific keys
wholly_tracked = set()
for key in s_list:
if not isinstance(key, dict):
continue
wholly_tracked.update(k for k, params in key.items() if not params)

for key in s_list:
if isinstance(key, str):
d[default_file].append(key)
if default_file not in wholly_tracked:
d[default_file].append(key)
continue

if not isinstance(key, dict):
msg = "Only list of str/dict is supported. Got: "
msg += f"'{type(key).__name__}'."
raise ValueError(msg)

for k, params in key.items():
if k in wholly_tracked:
d[k] = []
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are trying to do here is that we are ignoring any specific paths if None is already specified.
Eg:

params:
 - foo
 - bar
 - params.yaml:
 - config.yaml:
    - foo
    - bar
 - config.yaml:
 - config2.yaml:
    - foo
    - bar
 - config2.yaml:
   - lorem
   - ipsum

The specific targets for file with None specified are ignored, and paths for files with the same name are merged together to create:

{"params.yaml": [], "config.yaml": [], "config2.yaml": ["foo", "bar", "lorem", "ipsum"]}

Note that empty list or None denotes that there are no specific targets and the file has to be tracked in whole.

@skshetry skshetry force-pushed the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch from d761451 to b08a3e6 Compare March 20, 2022 16:16
@skshetry skshetry force-pushed the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch from b08a3e6 to 6c1fb58 Compare March 21, 2022 07:27

def dumpd(self):
ret = super().dumpd()
if not self.hash_info:
ret[self.PARAM_PARAMS] = self.params
ret[self.PARAM_PARAMS] = self.params or {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is questionable, especially for .dvc files with params. But since the params support in .dvc files are undocumented and does not trigger this piece of code, I am keeping this as-is for simplicity. πŸ™ˆ

@skshetry skshetry merged commit a3bab48 into iterative:main Mar 21, 2022
@skshetry skshetry deleted the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch March 21, 2022 08:02
@dberenbaum
Copy link
Collaborator

@skshetry πŸ”₯ Could you create the docs ticket now?

daavoo added a commit to iterative/dvcyaml-schema that referenced this pull request Apr 1, 2022
daavoo added a commit to iterative/dvcyaml-schema that referenced this pull request Apr 4, 2022
@skshetry skshetry restored the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch April 27, 2022 03:56
@skshetry skshetry deleted the 4112-run-allow-tracking-of-all-params-in-a-file-with-`-p` branch May 3, 2022 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: params Related to dvc params feature is a feature
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

run: allow tracking of all params in a file with -p
2 participants