-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provider Databricks add jobs create operator. #29790
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
cc: @alexott ? |
@kyle-winkelman what are the use cases for that operator? |
My specific use case is that I want to run multiple tasks in the same job and on a single The other approach is to use the DatabricksRunNowOperator which has the limitation that you have to define your Databricks Job somewhere else and in some different manner (i.e. manually in Databricks UI, custom CI/CD pipeline, etc.). My team doesn't like having the definition of the job be separated from the use of it from Airflow. In my opinion the DatabricksRunNowOperator with just a single So to sum up, it is useful to be paired with the DatabricksRunNowOperator to define a job and run it in the same DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll discuss this further with the Databricks Workflows team
tasks: list[object] | None = None, | ||
job_clusters: list[object] | None = None, | ||
email_notifications: object | None = None, | ||
webhook_notifications: object | None = None, | ||
timeout_seconds: int | None = None, | ||
schedule: dict[str, str] | None = None, | ||
max_concurrent_runs: int | None = None, | ||
git_source: dict[str, str] | None = None, | ||
access_control_list: list[dict[str, str]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new operators I would really like to use real data structures (for example, data classes), not the simple objects as it doesn't provide users with auto-completion, etc. - it's easy to make mistake in the untyped JSON definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of a different operator that follows this pattern? It would be helpful to see an example.
Any thoughts on a tool to generate such data structures from the Databricks OpenAPI Spec? There are a lot of data structures that would need to be created and I don't want to do so manually.
My initial thought was to rely on Databricks Python API, but it doesn't have these data structures or validations either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have existing example, it was just thoughts for the new operators. New Python API will be available soon, that will provide access to the latest APIs. I need to ask dev team when it’s coming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this new operator should wait for the new Python API that you expect to be available soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m waiting for answer from product management - maybe we won’t need this operator…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you heard anything from the Databricks product management @alexott?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no decision yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexott lets please get to a decision in the upcoming days (before next wave of providers)... if not i will accept the PR as is.
Should in the future Databricks decide against these operators and present arguments why we shouldn't have them we can always remove with a major release.
include_prior_dates=True, | ||
) | ||
if self.job_id: | ||
self._hook.reset(self.job_id, self.json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of I would recommend to get the current job definition and compare with the new definition, and reset only if this definition changes (that should happen relatively rare).
Also, you need to handle following edge cases:
- Job is deleted via UI - the reset will fail because job ID doesn't exist
- We can lose XCom information, so we'll create a duplicate for the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexott convinced me that it needs changes.
@potiuk Unfortunately no news. The decision takes longer than planned. If we can deprecate it in the later versions, let merge it. |
OK. Cool. I am a big fan of having tactical solutions that solve part of the probem or a problem for smaller group of people and releasing them early, as long as they are not preventing strategic long term solutions that needs a longer debate and far more work to be implemented later. This one looks like one of those. I re-reviewed it and it looks cool. @kyle-winkelman -> can you please rebase and fix the static check problem, then I am happy to merge it (@eladkal - WDYT?). |
@potiuk @kyle-winkelman This seems like a useful addition from the Databricks perspective. Thanks for contributing, Kyle! 🙏 I do think it would be good to look at the XCOM issue that Alex called out. Wouldn't it be better to just use the API to find an existing job for the cases that Alex called out where the current approach doesn't work? To avoid creating duplicate jobs or having the operator fail? The |
hey @kyle-winkelman do you have bandwidth to move forward with this PR? |
@potiuk is it okay if someone else forks Kyle's repo and fix the static check to not lose @kyle-winkelman's credit/commits and also address the previous comments made by @alexott? I am also okay to wait a bit for Kyle to respond. |
Fine for me. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Add the DatabricksJobsCreateOperator for use cases where the DatabricksSubmitRunOperator is insufficient.
closes: #29733