Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

Closed
hugocool opened this issue Jul 11, 2023 · 6 comments
Closed

Comments

@hugocool
Copy link

Description

When running kedro pipelines on AWS batch with kedro-mlflow every node is logged as a separate run, this is because the pipeline is executed on batch by running each node in a docker container with a separate docker run command, i.e kedro run --node=... .

Context

This is undesirable since these are not separate runs, simply individual nodes and this quickly pollutes your mlflow tracking server.
Therefore each kedro run command issued to batch should be made aware of the already active_run this node is part of.

Possible Implementation

While the changes to the batchrunner should be implemented in the deployment pattern, kedro-mlflow should allow one to pass a mlflow_run_id cli kwarg which sets the run_id.
Im currently implementing a solution using the configloader, a custom cloudrunner and changes to the batchrunner cli.
Im curious whether there is a better/more minimal alternative.

Possible Alternatives

setting a environment variable?
overriding the run_id with the git commit? (although this is difficult in batch since the container should be made aware of the git-commit)

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Jul 16, 2023

Hi @hugocool, this is a common feature request, and this is partially possible yet.

First, I want to insist the fact that on AWS batch each node is logged in a separated run is a feature, not a bug 😄 AWS batch nodes are orchestrator nodes and they don't have the same purpose that kedro nodes. You can find a very similar discussion about kedro-mlflow support for airflow here: #44 which explains that orchestrator nodes should rather be mapped to kedro pipelines than nodes. This has been discussed with the kedro team a couple of times too.

That said, you request is valid: you may want to propagate a mlflow run id through different orchestrator nodes. Some good news:

  1. if a mlflow run is already active, kedro-mlflow uses it instead of starting a new one, so nothing prevents you to start a mlflow run "manually", e.g. with mlflow.start_run(run_id=YOUR_RUN_ID) before running the node. I think mlflow used to let you set up a MLFLOW_RUN_ID environment variable but it was not encouraged so I am not sure what is the current recommandation for this. The drawback is that if you you need to set up the entire mlflow configuration manuall (including the MLFLOW_TRACKING_URI, the MLFLOW_REGISTRY_URI...) because kedro-mlflow will ignore its configuration file
  2. kedro-mlflow has a mlflow.tracking.run.id in its configuration. If you override the configuration file, or just this key (e.g. with the OmegaConfigLoader you can use a custom resolver to read an environment variable), this will work out of the box.

Both solutions are valid and quite easy to setup. You are not the first one who wants to add some configuration overriding at runtime through CLI args (see #395) but I am quite reluctant to add some extra API when I think kedro will enable it natively with OmegaConfigLoader and runtime parameters (see #kedro-org/kedro#2504, kedro-org/kedro#2175) because this generates much more boilerplate code and responsibility on my side and I have hard time to support this so I'd prefer it to be on the framework side.

@hugocool
Copy link
Author

Would it not make sense to use a kedro run parameter override?
kedro run --params=<param_key1>:<value1>,<param_key2>:<value2>

so basically i need to override the custom kedro run command send to each docker container on batch to be kedro run --params=mlflow.tracking.run.id:main_run_id, where main_run_id is the mlflow run id of the main proces that manages the execution of nodes on batch.

one could probably make this work through the TemplatedConfigLoader and the use of a global.
Although after some digging around i found this working for kedro 0.17, and could be a different story with 0.18 and the OmegaConfLoader.

Also, I dont know at the top of my head how to push this run.id to kedro-mlflow.
I could do the following in the mlflow.yml:

  run:
    id: "${main_run_id|None}" # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True  # if `nested` is False, you won't be able to launch sub-runs inside your nodes

And then try to pass it as a global, but i cant do extra_params with 0.18.

Do you have an example of how to override the mlflow.tracking.run.id?

I would love to contribute a full working solution, and incorporate it into a kedro-aws extension that is compatible with kedro-mlflow!

@marrrcin
Copy link

marrrcin commented Jul 21, 2023

If AWS Batch has some unique ID environment variable injected to every container (like run ID, but specific to the AWS Batch service itself), you can follow the same idea we have for kedro-sagemaker where we first add a node to a pipeline to "start mlflow run" which adds a tag to mlflow (mlflow.tag) with this unique identifier (PIPELINE_EXECUTION_ARN in SageMaker case, I guess that for AWS Batch it would be one of the variables from https://docs.aws.amazon.com/batch/latest/userguide/job_env_vars.html ) as shown here: https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/cli.py#L380 and then the subsequent nodes lookup the MLflow RUN ID by using MLflow SDK as shown here https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/cli_functions.py#L104 and set it to MLFLOW_RUN_ID environment variable in the container, before Kedro starts ;) We have this programmed in docker entrypoints due to SageMaker limitations, but you can do the same with Kedro Hooks.

@Galileo-Galilei
Copy link
Owner

Yes, @marrrcin suggestion is likely the best way to do it: as explained, you need to start mlflow yourself in the container (e.g. by setting manually the MLFLOW_RUN_ID environment variable) before the kedro run starts so that kedro-mlflow will use it as the default configuration instead of starting a new run. The drawback is that you have to set up all the environment variables too (including MLFLOW_TRACKING_URI, MLFLOW_REGISTRY_URI).

@hugocool
Copy link
Author

Thanks @marrrcin for the suggestion, I had not thought of that!

So the main difference with your suggestion would be the way the mlflow run id is communicated between the orchestrator and the AWS batch containers.

In one approach, it is communicated through the docker run command for the container running in AWS batch, so the Kedro runner (AWSBatchRunner) could set this through the Kedro run --Params=... command. This would be passed as a command override, and could then be picked up in Kedro-mlflow and set as the run id.

The approach mentioned by @marrrcin, if I understand correctly, leverages a identifier issued by AWS batch which shared across all the containers resulting from a single run. This could be Job Definition ID, in the way I am currently implementing it since for every kedro run I create a job definition to set the kedro run command.
One could tag the mlflow run with this job definition id, and look it up before before run of a container in AWS batch. This approach does not need an extra node to run, since the job definition is created before any nodes are running in batch (unless you count the nodes running locally to orchestrate the batch jobs).
The downsides to this approach is that we now also need to set the other mlflow configurations variables as mentioned by @Galileo-Galilei , there will be an API call to mlflow before running the node in batch, and this mechanism does not seem very opaque/hook into established methods for modifying variables before a run. (At least using a kedro-hook is, but looking up and setting env vars does not).

One other thing to take into account is that with the recent addition of support for prefect 2.0, there is now also the possibility of using the prefect AWS batch job. Since I want to migrate to using a single open source orchestrator for as many projects as possible, I would like the method to work for prefect as well.
We discussed some of the hiccups in the slack channel already, but we could open a separate issue for that of course.

@Galileo-Galilei
Copy link
Owner

I close the issue in favor of #395. I hope we can make it work after the 0.19 release with OmegaConfigLoader resolvers, but it still need some work and design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants