You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The databricks plugin does not check for all possible terminal states as returned from the Databricks Jobs API. As a result, if the Databricks jobs API returns a state which the plugin is not written to recognise, the plugin will think that the job is still in a running phase.
Here we only check if the returned state is "TERMINATED" and if not we assume the job is in running state. However the Databricks API defines several terminal state - TERMINATED,SKIPPED, INTERNAL_ERROR.
When I ran a databricks job using the plugin recently (3 days ago) I stumbled upon the INTERNAL_ERROR status. The cause was related to EC2 or AWS networking. Either way, the Flyte databricks plugin did not capture the end of the job and reported it as RUNNING.
As a minimum we should capture all terminal states, something like this:
case http.StatusOK:
if lifeCycleState == "TERMINATED" || lifeCycleState == "SKIPPED" || lifeCycleState == "INTERNAL_ERROR" {
Ideally we should be able to capture all states, not just the terminal ones, and show them in the flyteconsole, i.e. states like PENDING, RUNNING and TERMINATING. However this might not be possible if flyte has a predefined states for its jobs. However we need better handling of the databricks jobs statuses.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes
The text was updated successfully, but these errors were encountered:
Describe the bug
The databricks plugin does not check for all possible terminal states as returned from the Databricks Jobs API. As a result, if the Databricks jobs API returns a state which the plugin is not written to recognise, the plugin will think that the job is still in a running phase.
https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/webapi/databricks/plugin.go#L227
Here we only check if the returned state is "TERMINATED" and if not we assume the job is in running state. However the Databricks API defines several terminal state - TERMINATED,SKIPPED, INTERNAL_ERROR.
https://docs.databricks.com/en/workflows/jobs/jobs-2.0-api.html#runlifecyclestate
When I ran a databricks job using the plugin recently (3 days ago) I stumbled upon the INTERNAL_ERROR status. The cause was related to EC2 or AWS networking. Either way, the Flyte databricks plugin did not capture the end of the job and reported it as RUNNING.
The slack thread describing this is here - https://flyte-org.slack.com/archives/CP2HDHKE1/p1696954554555229
Expected behavior
As a minimum we should capture all terminal states, something like this:
Ideally we should be able to capture all states, not just the terminal ones, and show them in the flyteconsole, i.e. states like PENDING, RUNNING and TERMINATING. However this might not be possible if flyte has a predefined states for its jobs. However we need better handling of the databricks jobs statuses.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: