logger: extend with `log_param` #100

daavoo · 2021-06-21T18:12:29Z

Motivation

Other popular ML Loggers have methods for logging and tracking hyperparameters:

The existence of these methods are useful when building integrations with ML Frameworks, as they are commonly used to automatically log and track internal configuration:

When using dvclive as an alternative to those ML Loggers, that information is being lost.

Proposal

It could be useful to extend dvclive with a new log_param method in order to cover these needs.

I think that dvclive.log_param could just write the values to an output file in yaml format (i.e. default dvclive.yml). This way, there could be a good similarity with existing outputs of dvclive and how they integrate with dvc features:

dvclive/*.tsv -> dvc plots
dvclive.json -> dvc metrics
dvclive.yml -> dvc params

In addition to that, existing and future integrations of dvclive with ML Frameworks will have a closer feature parity when compared to other ML Logger alternatives.

The text was updated successfully, but these errors were encountered:

dberenbaum · 2021-06-21T18:43:15Z

When using dvclive as an alternative to these ML Loggers, the user loses all that information.

Hyperparameters can be tracked in params.yaml or another file. In the general DVC workflow, hyperparameters are read from a file into the model training code rather than being written from the model training code to a file. This seems like good practice that encourages separation between parameters/configuration and code, and allows for the hyperparameters to be tracked as a dependency of the code.

dvc params or dvc exp show or similar commands do not require any kind of explicit parameter logging methods. One advantage of using the DVC ecosystem is that it encourages this separation and maintains hyperparameters in this simple file structure, requiring fewer manual logging methods in the code itself.

Do you think this is enough? Is there some additional benefit to adding a log_param method? Would it be helpful to have a read_param method instead to reduce boilerplate code for parsing the parameters file?

daavoo · 2021-06-22T08:26:30Z

I see your points and totally agree with encouraging the separation between params and code. However I see a few "counterpoints":

When using dvclive as an alternative to these ML Loggers, the user loses all that information.

Hyperparameters can be tracked in params.yaml or another file. In the general DVC workflow, hyperparameters are read from a file into the model training code rather than being written from the model training code to a file. This seems like good practice that encourages separation between parameters/configuration and code, and allows for the hyperparameters to be tracked as a dependency of the code.

As far as I know, dvclive is intended to be seamlessly integrated with DVC but without being a strict requirement itself. Given that, I feel that a log_param method could have value on it's own for using dvclive in "standalone" mode.

dvc params or dvc exp show or similar commands do not require any kind of explicit parameter logging methods. One advantage of using the DVC ecosystem is that it encourages this separation and maintains hyperparameters in this simple file structure, requiring fewer manual logging methods in the code itself.

I think that current users of DVC (this is the case in our team) could be using params.yml for tracking configuration of "data processing" stages and some hyperparameters in the "training stage".

However in the "train stage" we use a ML Framework + ML Logger integration that logs additional information to the the tracking server. This is the information I referred to as "being lost".

If we replace our current ML Logger with dvclive we would need to add quite a few modifications to the "train stage" code (manually reading from params and passing as args to ML Framework) in order to match all the params that are currently being logged/tracked automatically.

Do you think this is enough? Is there some additional benefit to adding a log_param method?

In the context of using log_param as part of a ML Framework + ML Logger integration I think it won't result in more "manual logging", on the contrary users will get more params tracked by just choosing to use dvclive as their ML Logger

In addition, I see a specific use case where log_param could cover a need that (afaik) can't be covered with existing dvc params functionality; When using an "hyperparameter optimization framework" (i.e. KerasTuner) some of the hyperparameters are being selected and set within the code. In this context, there is no way of using dvc for tracking the params/metrics associated with each individual experiment/trial. If we add log_param to dvclive we could use the existing checkpoint functionality:

class MyTuner(kt.Tuner):

    def run_trial(self, trial, ...):
        for k, v in trial.hyperparameters.items():
            dvclive.log_param(k, v)
        ...
        metrics = ...
        for k, v in metrics.items():
            dvclive.log(k, v)

        # create checkpoint with params and metrics for this "trial"
        dvclive.next_step()

Would it be helpful to have a read_param method instead to reduce boilerplate code for parsing the parameters file?

I think that would be quite helpful, although it'd probably be better discussed in a separated dvc issue, right?

Regardless of reducing boilerplate, there are scenarios where I don't see how replacing manual logging methods with manual reading and passing would be better

dberenbaum · 2021-06-22T17:59:22Z

However in the "train stage" we use a ML Framework + ML Logger integration that logs additional information to the the tracking server. This is the information I referred to as "being lost".

If we replace our current ML Logger with dvclive we would need to add quite a few modifications to the "train stage" code (manually reading from params and passing as args to ML Framework) in order to match all the params that are currently being logged/tracked automatically.

Okay, so it's about extracting metadata from the model object, right? If the model is saved, this is usually going to be available I think, but extracting that metadata to a human-readable file makes sense.

We might need to think through whether this is the best way to capture that info so that it integrates well with dvc. It seems pretty different from params.yaml since params.yaml is an input/dependency and these params are really an output of the code (although maybe they are an input to the model fit method). Also, unlike log_metric, it is summary info that doesn't belong at the step level (I think you already implied this point with your suggested file structure).

When using an "hyperparameter optimization framework" (i.e. KerasTuner) some of the hyperparameters are being selected and set within the code.

Support for hyperparameter searches is a great point, and I like the proposal, although that might be a bit off-topic, and might need to start with dvc before worrying about dvclive support. I was just thinking about this earlier in iterative/dvc#6194 (comment).

dmpetrov · 2021-06-22T20:40:18Z

Okay, so it's about extracting metadata from the model object, right? If the model is saved, this is usually going to be available I think, but extracting that metadata to a human-readable file makes sense.

Reporting model object meta data is a good motivation for log_param()! Otherwise, params.yaml seems like a better practice. In most of the projects, people still use params files but have to do a double work report log_param() just because the logger framework pushes them to it.

What are the examples of model object metadata that will need params files, not metrics?

dberenbaum · 2021-06-24T13:11:54Z

What are the examples of model object metadata that will need params files, not metrics?

I'd be interested to hear what @daavoo thinks. In the past, I have logged whatever I can about the model object, which mostly included implicit parameters like default values that I didn't specify. These mostly weren't that useful in my case since they didn't change between experiments, but it was still occasionally nice to be able to know what those parameter values were without having to load the model to extract them.

daavoo · 2021-07-19T08:50:55Z

For reference, review existing ML Logger integrations with Optuna #118, which is an Hyperparameter Optimization Framework.

It looks like log_param could be more useful when integrating with Hyperparameter Optimization Frameworks than when integrating with ML Frameworks.

daavoo · 2021-07-19T09:07:20Z

I'd be interested to hear what @daavoo thinks. In the past, I have logged whatever I can about the model object, which mostly included implicit parameters like default values that I didn't specify. These mostly weren't that useful in my case since they didn't change between experiments, but it was still occasionally nice to be able to know what those parameter values were without having to load the model to extract them.

In may occasions I have found the params automatically logged by other ML Loggers quite useful, specially when collaborating with other people to train the same model, or even I have found myself not caring about a param until i read a paper that showcases how critical it could be.

For example (toy example to get the point), the mlflow<>keras integration automatically (with log_param) logs the value of the learning_rate . Maybe one person of the team is really specialized on model architectures and doesn't really care about that parameter, however a collaborator who is really into optimizers could review previous experiments and know the values used. Potentially, the collaborator could launch an experiment itself updating that learning rate value.

In the context of dvclive and it's relation with dvc params, I think that maybe having a log_param <> read_param functionality could be useful. So in the first experiment run dvclive will use log_param to "register" all potential parameters. If the user adapts it's code to use read_param, subsequent experiment runs could use those registered parameters.

pared · 2021-07-20T11:42:04Z

I wonder how we could integrate this functionality whit DVC. In our designated workflow, the training code reads the params, so that DVC can take control over params files. When we implement this, we will also have params, but they will be more of "run information", rather than control parameter, so in a way, they will resemble metrics.
I guess we could make dvclive dir contain params and metrics dirs, but that would mean that either we have to use dvc metrics diff to compare dvclive - sourced params, or implement some special behavior for dvclive outputs. Since DVC has a special notion of dvclive output it might be possible to do.

pared added the enhancement label Jun 29, 2021

daavoo added discussion requires active participation to reach a conclusion research and removed enhancement labels Jul 12, 2021

dberenbaum mentioned this issue Jul 19, 2021

integrations: Optuna #118

Closed

dtrifiro added this to DVC Aug 31, 2022

dtrifiro self-assigned this Aug 31, 2022

dtrifiro moved this to Backlog in DVC Aug 31, 2022

dtrifiro moved this from Backlog to In Progress in DVC Aug 31, 2022

dtrifiro moved this from In Progress to Review In Progress in DVC Sep 20, 2022

daavoo linked a pull request Sep 26, 2022 that will close this issue

add log_param/log_params #292

Merged

daavoo closed this as completed Sep 26, 2022

Repository owner moved this from Review In Progress to Done in DVC Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logger: extend with `log_param` #100

logger: extend with `log_param` #100

daavoo commented Jun 21, 2021 •

edited by dtrifiro

Loading

dberenbaum commented Jun 21, 2021

daavoo commented Jun 22, 2021 •

edited

Loading

dberenbaum commented Jun 22, 2021

dmpetrov commented Jun 22, 2021

dberenbaum commented Jun 24, 2021

daavoo commented Jul 19, 2021 •

edited

Loading

daavoo commented Jul 19, 2021 •

edited

Loading

pared commented Jul 20, 2021

logger: extend with log_param #100

logger: extend with log_param #100

Comments

daavoo commented Jun 21, 2021 • edited by dtrifiro Loading

Motivation

Proposal

dberenbaum commented Jun 21, 2021

daavoo commented Jun 22, 2021 • edited Loading

dberenbaum commented Jun 22, 2021

dmpetrov commented Jun 22, 2021

dberenbaum commented Jun 24, 2021

daavoo commented Jul 19, 2021 • edited Loading

daavoo commented Jul 19, 2021 • edited Loading

pared commented Jul 20, 2021

logger: extend with `log_param` #100

logger: extend with `log_param` #100

daavoo commented Jun 21, 2021 •

edited by dtrifiro

Loading

daavoo commented Jun 22, 2021 •

edited

Loading

daavoo commented Jul 19, 2021 •

edited

Loading

daavoo commented Jul 19, 2021 •

edited

Loading