Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cancel python model job when dbt exit #690

Closed

Conversation

gaoshihang
Copy link

@gaoshihang gaoshihang commented May 29, 2024

Resolves #684

Description

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@gaoshihang
Copy link
Author

Hi @benc-db Could you please help review on this PR? the code is not ready, but I want you look it first, see if the method is fine. it related to this issue: #684

I think we can utilize Databricks workspace, we create a job_run_ids dir, each python model create one file named run_id in the job_run_ids dir.

When dbt canceled, we read from this job_run_ids dir, and cancel all run_id in it, then we delete the file.

@gaoshihang
Copy link
Author

And I have tested it, the job can be canceled, and fail-fast works ok.

for run_id in run_ids:
self._cancel_run_id(run_id_dir, run_id)

return super().cancel_open()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below you raise an exception on a non-200, but that will interrupt cancelling the other operations. Better to log a warning on non-200 I think.


return super().cancel_open()

def _cancel_run_id(self, run_id_dir: str, run_id: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since neither of these methods rely on anything in self, I think I would prefer them as static functions in python_submissions.py, so they are closer to the code that they are cleaning up.

@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

Hi @benc-db Could you please help review on this PR? the code is not ready, but I want you look it first, see if the method is fine. it related to this issue: #684

I think we can utilize Databricks workspace, we create a job_run_ids dir, each python model create one file named run_id in the job_run_ids dir.

When dbt canceled, we read from this job_run_ids dir, and cancel all run_id in it, then we delete the file.

If we can locate the target folder, I think I would prefer writing there, so that we don't rely on an API operation to store and retrieve. Also, I think we need to lock around writing the file to ensure clean operation when kicking off multiple python models concurrently.

@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

@mikealfare we're trying to figure out how to cancel python jobs as part of cleanup, similar to what is done for SQL queries when the user ctrl-Cs. Is there a better way to communicate run_ids from the python job helper to the connection manager? We were wondering if maybe there was some global state that would help the python job helper figure out the target directory?

@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

@jtcohen6 as well

@gaoshihang
Copy link
Author

Hi @benc-db Could you please help review on this PR? the code is not ready, but I want you look it first, see if the method is fine. it related to this issue: #684
I think we can utilize Databricks workspace, we create a job_run_ids dir, each python model create one file named run_id in the job_run_ids dir.
When dbt canceled, we read from this job_run_ids dir, and cancel all run_id in it, then we delete the file.

If we can locate the target folder, I think I would prefer writing there, so that we don't rely on an API operation to store and retrieve. Also, I think we need to lock around writing the file to ensure clean operation when kicking off multiple python models concurrently.

Hi @benc-db Many thanks for you help. I didn't found a way to get target path in python_submission.py, I'll try to find today.

@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

Hi @benc-db Could you please help review on this PR? the code is not ready, but I want you look it first, see if the method is fine. it related to this issue: #684
I think we can utilize Databricks workspace, we create a job_run_ids dir, each python model create one file named run_id in the job_run_ids dir.
When dbt canceled, we read from this job_run_ids dir, and cancel all run_id in it, then we delete the file.

If we can locate the target folder, I think I would prefer writing there, so that we don't rely on an API operation to store and retrieve. Also, I think we need to lock around writing the file to ensure clean operation when kicking off multiple python models concurrently.

Hi @benc-db Many thanks for you help. I didn't found a way to get target path in python_submission.py, I'll try to find today.

I'm reaching out to dbt Labs folks to see if there is a better way. In particular, @mikealfare has worked on the dbt-spark adapter, so if we figure it out, it might be good for that library too.

@gaoshihang
Copy link
Author

gaoshihang commented May 30, 2024

Hi @benc-db I'm thinking can we use a class static variable to share run_ids?
image
image

@benc-db
Copy link
Collaborator

benc-db commented May 30, 2024

Hi @benc-db I'm thinking can we use a class static variable to share run_ids? image

Good point. I'm generally so anti-global state that I didn't even think of it :P. We still need to protect it from concurrency issues, but global state is going to be our best bet until we get support from dbt-core, and better to store in memory than cloud files.

@gaoshihang
Copy link
Author

Hi @benc-db I'm thinking can we use a class static variable to share run_ids? image

Good point. I'm generally so anti-global state that I didn't even think of it :P. We still need to protect it from concurrency issues, but global state is going to be our best bet until we get support from dbt-core, and better to store in memory than cloud files.

Yes...I don't want to use it too...but seems like if we don't want to do things in dbt-core, it's the only way we can share the state in two class..Let's me write some code in this way!

@gaoshihang
Copy link
Author

Hi @benc-db, I revise the code, using a global variable to store all the run id, then cancel them in ConnectManager. please help to review this, thank you very much!

@@ -475,6 +476,17 @@ class DatabricksConnectionManager(SparkConnectionManager):
TYPE: str = "databricks"
credentials_provider: Optional[TCredentialProvider] = None

def cancel_open(self) -> List[str]:
from dbt.adapters.databricks.python_submissions import BaseDatabricksHelper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import at the top please. We only import in place like this if the thing we're importing is too heavy to do at start up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

from dbt.adapters.databricks.python_submissions import BaseDatabricksHelper

for run_id in BaseDatabricksHelper.run_ids:
logger.info(f"cancel run id {run_id}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be debug, and we should mention that it's a python model job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -15,9 +15,11 @@

class token_auth(CredentialsProvider):
_token: str
_host: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why store this on the token? It's already on the DatabricksCredentials.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I can't get DatabricksCredentials in DatabricksConnectionManager.

I can just use self.credentials_provider in DatabricksConnectionManager, but there are no host in credentials_provider, so I put a host in the token_auth class.

Could you give me a direction how to get DatabricksCredentials in DatabricksConnectionManager?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BaseDatabricksHelper has a copy of DatabricksCredentials.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but that's an instance...let me think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to pull down a copy of this PR and see if i can figure it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes..think I can't get instance in DatabricksConnectionManager...

Thank you very much!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, did you already fix this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can't found a way.. so I still use the way put host in token_auth. so that the host can be retrieve from DatabricksConnectionManager.credentials_provider.

Copy link
Collaborator

@benc-db benc-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, just some minor comments to clean up.

@mikealfare
Copy link
Contributor

@mikealfare we're trying to figure out how to cancel python jobs as part of cleanup, similar to what is done for SQL queries when the user ctrl-Cs. Is there a better way to communicate run_ids from the python job helper to the connection manager? We were wondering if maybe there was some global state that would help the python job helper figure out the target directory?

It looks like we might be trying to do something similar here for dbt-bigquery. I haven't read through either PR in detail to know if this solves your problem, but I figured I'd link it here in the event it's helpful.

def cancel_open(self) -> List[str]:
for run_id in BaseDatabricksHelper.run_ids:
logger.debug(f"Cancel python model job: {run_id}")
BaseDatabricksHelper.cancel_run_id(run_id, self.credentials_provider.as_dict()['token'], self.credentials_provider.as_dict()['host'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, this is where you need to retrieve, and here you don't have an instance...we can use singleton pattern maybe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give me an hour to take a crack at refactoring this; I have an idea :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleton maybe its a way, Let me try some code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! thank you very much!

@benc-db
Copy link
Collaborator

benc-db commented Jun 3, 2024

This is annoying, but is actually teaching me a lot about our python model support lol. It's taking me longer than I expected because I'm trying to get it to work with all credentials and all execution formats (i.e. commands vs notebooks)

@gaoshihang
Copy link
Author

This is annoying, but is actually teaching me a lot about our python model support lol. It's taking me longer than I expected because I'm trying to get it to work with all credentials and all execution formats (i.e. commands vs notebooks)

No Rush! thanks for your support. please let me know when you done, and I can modify the code in this PR~

@gaoshihang
Copy link
Author

Hi @benc-db I'm reaching out to see if there are any things I can help!

@benc-db
Copy link
Collaborator

benc-db commented Jun 4, 2024

Hi @benc-db I'm reaching out to see if there are any things I can help!

I'll have a new PR up shortly...got stalled because weather knocked out my internet yesterday. After I put up my PR, if you could download and validate that it works for your scenario, that would be great.

@benc-db benc-db mentioned this pull request Jun 4, 2024
3 tasks
@benc-db
Copy link
Collaborator

benc-db commented Jun 4, 2024

@gaoshihang closing in favor of #693. Please take a look and verify it works for your use case.

@benc-db benc-db closed this Jun 4, 2024
@gaoshihang
Copy link
Author

Hi @benc-db , thank you very much! will do that later, and let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python model doesn't have cancel method
3 participants