-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
params: allow tracking file #7432
params: allow tracking file #7432
Conversation
586213b
to
6edd526
Compare
6edd526
to
d761451
Compare
params = self.params or ldistinct([*actual.keys(), *info.keys()]) | ||
for param in params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note here, this violates one of the principles on which dvc.yaml
was built: that dvc.yaml
is a single source of truth. Given that the feature that params provide is granular tracking, and given that we only have the filename of hyperparameters file in the dvc.yaml
metadata, we don't know what we are supposed to be tracking without considering dvc.lock
data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than violate this principle, I'd prefer to just keep dvc status
showing changes at the file level. Again, I think it's similar to how param keys are handled today. If I have a stage with param foo
and modify the value of foo.bar
, dvc status
shows foo
is modified, not foo.bar
(even though we could get foo.bar
from dvc.lock
). Similarly, I think for tracking params.yaml:
, dvc status
should show params.yaml
is modified, not any individual parameter key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider only showing modified
on the file-level itself, but I think granular tracking of params to be an important feature that makes it distinct from other dependencies/outputs.
I did this only in status
because status
is not a static operation, but a very active operation that inspects the user's workspace. So it already has a context of what things are in the user's hyperparameters, so it seemed fine to also look into the lockfile (that helps us understand what was actually tracked previously so that we can provide new
/deleted
statuses which we kind of already did).
So, I consider violating that to be a minor issue here in the status
.
dvc status should show params.yaml is modified, not any individual parameter key.
But again, I don't have a strong opinion here. Let me know if you feel that strongly here, should be a straightforward change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that helps us understand what was actually tracked previously so that we can provide
new
/deleted
statuses which we kind of already did
Yeah, now that I think about it more, I'm not that clear on how this is different from what's already happening, which seems to be comparing parameters in dvc.yaml
to dvc.lock
.
But again, I don't have a strong opinion here. Let me know if you feel that strongly here, should be a straightforward change.
I would opt for whatever is the simplest and closest to the current implementation. We are about to do much larger changes in status
soon, including separating data and stage status. We can revisit then when we have a clearer picture of what we want to do with this info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, now that I think about it more, I'm not that clear on how this is different from what's already happening, which seems to be comparing parameters in
dvc.yaml
todvc.lock
.
yes we already did that, but here in this case we don't know what parameters we are supposed to be comparing without actually looking into hyperparameters file and dvc.lock
combined. As I said, it's a very minor issue and not very relevant in the status
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would opt for whatever is the simplest and closest to the current implementation. We are about to do much larger changes in
status
soon, including separating data and stage status. We can revisit then when we have a clearer picture of what we want to do with this info.
As we have this already implemented, we can just go with what we have. Like you said, we can revisit this without any issues in the future.
for param in self.params: | ||
if param in values: | ||
info[param] = values[param] | ||
self.hash_info = HashInfo(self.PARAM_PARAMS, info) | ||
|
||
def read_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_params_d
and read_params
was merged into one here.
info = ( | ||
{self.PARAM_PARAMS: params} if isinstance(params, dict) else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are trying to detect difference between the absence of lock entry vs lock entry tracking empty contents.
So, we use None
for absence of lock-entry and {}
(empty dict) for empty contents.
Relevant test here: https://github.com/iterative/dvc/pull/7432/files#diff-f608e1da524603a471079fc0daa1b288fa3e5e1928d50ed346a365d9f7771f82R159
if not self.isfile and not self.isdir: | ||
raise self.IsNotFileOrDirError(self) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isfile
and isdir
are not properties, they are methods. I'll fix this in a separate PR.
# figure out completely tracked params file, and ignore specific keys | ||
wholly_tracked = set() | ||
for key in s_list: | ||
if not isinstance(key, dict): | ||
continue | ||
wholly_tracked.update(k for k, params in key.items() if not params) | ||
|
||
for key in s_list: | ||
if isinstance(key, str): | ||
d[default_file].append(key) | ||
if default_file not in wholly_tracked: | ||
d[default_file].append(key) | ||
continue | ||
|
||
if not isinstance(key, dict): | ||
msg = "Only list of str/dict is supported. Got: " | ||
msg += f"'{type(key).__name__}'." | ||
raise ValueError(msg) | ||
|
||
for k, params in key.items(): | ||
if k in wholly_tracked: | ||
d[k] = [] | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are trying to do here is that we are ignoring any specific paths if None
is already specified.
Eg:
params:
- foo
- bar
- params.yaml:
- config.yaml:
- foo
- bar
- config.yaml:
- config2.yaml:
- foo
- bar
- config2.yaml:
- lorem
- ipsum
The specific targets for file with None
specified are ignored, and paths for files with the same name are merged together to create:
{"params.yaml": [], "config.yaml": [], "config2.yaml": ["foo", "bar", "lorem", "ipsum"]}
Note that empty list or None
denotes that there are no specific targets and the file has to be tracked in whole.
d761451
to
b08a3e6
Compare
b08a3e6
to
6c1fb58
Compare
|
||
def dumpd(self): | ||
ret = super().dumpd() | ||
if not self.hash_info: | ||
ret[self.PARAM_PARAMS] = self.params | ||
ret[self.PARAM_PARAMS] = self.params or {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code is questionable, especially for .dvc
files with params
. But since the params
support in .dvc
files are undocumented and does not trigger this piece of code, I am keeping this as-is for simplicity. π
@skshetry π₯ Could you create the docs ticket now? |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
(will create the docs after #7413)
The syntax for tracking files is:
Closes #4112