-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570
Comments
Thanks for the report, @OlofL. Could you elaborate a bit on what kind of workload you're running? Is your model generating a regular table, streaming table, materialized view, or something else? |
Of course. It is our standard daily build and it starts by creating table models, seeds then 88 different snapshots and then continuing with lots of incremental models, a few table models and finally a a number of views. We have two workspaces, one is using Serverless, one is using Pro SQL Warehouses. The problem only occured in the Pro. By swithcing to Pro instead of Serverless in the other the problem occured there also. |
No streaming tables, just a plan batch transformation. |
job.txt |
The combination of things working when switching to serverless or 1.7.3, as well as the failures giving up after < 3 minutes suggests to me that something has changed in our retry policy in databricks-sql-connector 3.0.1 (which we upgraded to with 1.7.4). Going to work with @susodapop to try to figure out what is different. |
Great! Since we are able to recreate the problem in both our environments I'm happy to help with any further testing if possible. |
@OlofL if you can get your dbt artifacts to me from one of the failed runs (either file a ticket with your Databricks contact, or you can email it to me at [email protected]), that would be very helpful. |
I can get it from our second environment, so it is in your mailbox now |
@OlofL do you see this issue at all with 1.7.3? I'm able to repro the failure this morning with 1.7.3 :/. |
Great that you can replicate it. We have only seen it in 1.7.4, not in 1.7.3. |
We are also experiencing the same issue with 1.7.4, but not with 1.7.3. 1.7.4 works OK if the SQL Warehouse is already running - but if the workflow has to start the SQL Warehouse that is when we encounter the 503 error. Adding retries to the task is another workaround for us at this point. |
Even after upgrading to 1.7.5, I'm still facing the same bug as described in this issue. I get the exact same error message as above. Does 1.7.5 resolve the issue for you @OlofL @andrewpdillon ? My full package list:
|
@OlofL @andrewpdillon @henlue please let me know if the above (1.7.4testing branch) resolves the issue. |
@OlofL @andrewpdillon @henlue any update? I'm looking for feedback that this branch resolves the issue before I release it, as this circumstance is very hard to repro internally. |
I configured a new workflow with 1.7.4 to first verify that I still had the error but I didn't get it this time. Is the error dependent on a specific version of dbt? I will try again. |
No, whether you see the error or not is dependent on how quickly compute is available, hence why reproducing is so hard; it is a function of the cloud providers compute availability. |
Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down. |
Yeah, for me, I have the best luck reproducing at 4:30 pm PST, and I have no idea why lol. |
That is when we in Europe start all our major workloads, at 01.30 CET. |
I have the same issue but with an older version of dbt: 1.4.9. In my case, I had a scheduled job that was working fine up to the 28th of November last year. Then it suddenly started failing the next day, without having done any changes from our side (development was on hold). Here's the log for the 28th execution:
And the 29th when the error started showing up:
System info:
Hope this helps shed some more light into the issue |
## Changes This adds a new dbt-sql template. This work requires the new WorkspaceFS support for dbt tasks. In this latest revision, I've hidden the new template from the list so we can merge it, iterate over it, and propertly release the template at the right time. Blockers: - [x] WorkspaceFS support for dbt projects is in prod - [x] Move dbt files into a subdirectory - [ ] Wait until the next (>1.7.4) release of the dbt plugin which will have major improvements! - _Rather than wait, this template is hidden from the list of templates._ - [x] SQL extension is preconfigured based on extension settings (if possible) - MV / streaming tables: - [x] Add to template - [x] Fix databricks/dbt-databricks#535 (to be released with in 1.7.4) - [x] Merge databricks/dbt-databricks#338 (to be released with in 1.7.4) - [ ] Fix "too many 503 errors" issue (databricks/dbt-databricks#570, internal tracker: ES-1009215, ES-1014138) - [x] Support ANSI mode in the template - [ ] Streaming tables support is either ungated or the template provides instructions about signup - _Mitigation for now: this template is hidden from the list of templates._ - [x] Support non-workspace-admin deployment - [x] Make sure `data_security_mode: SINGLE_USER` works on non-UC workspaces (it's required to be explicitly specified on UC workspaces with single-node clusters) - [x] Support non-UC workspaces ## Tests - [x] Unit tests - [x] Manual testing - [x] More manual testing - [ ] Reviewer manual testing - _I'd like to do a small bug bash post-merging._ - [x] Unit tests
Describe the bug
A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster.
This occurs only in connector version 1.7.4, not in version 1.7.3.
This occurs only with SQL Warehouse, not with SQL Warehouse Serverless
Steps To Reproduce
The problem occurs in a workflow in a databricks workspace with the following settings
Running on Azure, Databricks Premium, not Unity Catalog
Job cluster single node Standard_DS3_v2
Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized
git source Azure Devops
Settings for library version dbt-databricks>=1.0.0,<2.0.0
Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:
02:09:19 Running with dbt=1.7.5
02:09:20 Registered adapter: databricks=1.7.4
02:09:20 Unable to do partial parsing because saved manifest not found. Starting full parse.
02:09:38 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 920 macros, 0 groups, 0 semantic models
02:09:38
02:12:11
02:12:11 Finished running in 0 hours 2 minutes and 32.34 seconds (152.34s).
02:12:11 Encountered an error:
Runtime Error
HTTPSConnectionPool(host='[removed]', port=443): Max retries exceeded with url: /sql/1.0/warehouses/[removed] (Caused by ResponseError('too many 503 error responses'))
Changing to SQL Warehouse of type Serverless, X-Small, Cluster count: Active 0 Min 1 Max 1, Current solves the problem.
Running on older version of dbt-databricks library solves the problem (dbt-databricks>=1.0.0,<1.7.3)
The problem appeared on Jan 25 when the version was updated from 1.7.3 -> 1.7.4.
I have not been able to reproduce the same problem when starting dbt interactive, just as a workflow, but that can be my mistake, this has been a stressful couple days due to this.
Expected behavior
Log output:
02:08:27 Running with dbt=1.7.5
02:08:28 Registered adapter: databricks=1.7.3
02:08:28 Unable to do partial parsing because saved manifest not found. Starting full parse.
02:08:46 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 919 macros, 0 groups, 0 semantic models
02:08:46
02:14:30 Concurrency: 12 threads (target='prod')
02:14:30
02:14:31 1 of 1835 START sql table model staging.rollup12helper ......................... [RUN]
02:14:31 2 of 1835 START sql table model staging.rollup24helper ......................... [RUN]
Screenshots and log output
N/A
System information
dbt 1.7.5
dbt-databricks 1.7.4
Configured in a dbt workflow, not much control over settings there.
Additional context
N/A
The text was updated successfully, but these errors were encountered: