Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

OlofL · 2024-01-26T15:52:11Z

Describe the bug

A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster.
This occurs only in connector version 1.7.4, not in version 1.7.3.
This occurs only with SQL Warehouse, not with SQL Warehouse Serverless

Steps To Reproduce

The problem occurs in a workflow in a databricks workspace with the following settings
Running on Azure, Databricks Premium, not Unity Catalog
Job cluster single node Standard_DS3_v2
Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized
git source Azure Devops
Settings for library version dbt-databricks>=1.0.0,<2.0.0

Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:

dbt build --profiles-dir ../misc/misc/ -t prod -f
02:09:19 Running with dbt=1.7.5
02:09:20 Registered adapter: databricks=1.7.4
02:09:20 Unable to do partial parsing because saved manifest not found. Starting full parse.
02:09:38 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 920 macros, 0 groups, 0 semantic models
02:09:38
02:12:11
02:12:11 Finished running in 0 hours 2 minutes and 32.34 seconds (152.34s).
02:12:11 Encountered an error:
Runtime Error
HTTPSConnectionPool(host='[removed]', port=443): Max retries exceeded with url: /sql/1.0/warehouses/[removed] (Caused by ResponseError('too many 503 error responses'))

Changing to SQL Warehouse of type Serverless, X-Small, Cluster count: Active 0 Min 1 Max 1, Current solves the problem.
Running on older version of dbt-databricks library solves the problem (dbt-databricks>=1.0.0,<1.7.3)
The problem appeared on Jan 25 when the version was updated from 1.7.3 -> 1.7.4.
I have not been able to reproduce the same problem when starting dbt interactive, just as a workflow, but that can be my mistake, this has been a stressful couple days due to this.

Expected behavior

Log output:
02:08:27 Running with dbt=1.7.5
02:08:28 Registered adapter: databricks=1.7.3
02:08:28 Unable to do partial parsing because saved manifest not found. Starting full parse.
02:08:46 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 919 macros, 0 groups, 0 semantic models
02:08:46
02:14:30 Concurrency: 12 threads (target='prod')
02:14:30
02:14:31 1 of 1835 START sql table model staging.rollup12helper ......................... [RUN]
02:14:31 2 of 1835 START sql table model staging.rollup24helper ......................... [RUN]

Screenshots and log output

N/A

System information

dbt 1.7.5
dbt-databricks 1.7.4

Configured in a dbt workflow, not much control over settings there.

Additional context

N/A

lennartkats-db · 2024-01-26T16:15:35Z

Thanks for the report, @OlofL. Could you elaborate a bit on what kind of workload you're running? Is your model generating a regular table, streaming table, materialized view, or something else?

OlofL · 2024-01-26T17:00:20Z

Of course. It is our standard daily build and it starts by creating table models, seeds then 88 different snapshots and then continuing with lots of incremental models, a few table models and finally a a number of views.

We have two workspaces, one is using Serverless, one is using Pro SQL Warehouses. The problem only occured in the Pro. By swithcing to Pro instead of Serverless in the other the problem occured there also.

OlofL · 2024-01-26T17:00:58Z

No streaming tables, just a plan batch transformation.

OlofL · 2024-01-26T17:06:55Z

job.txt
Attached is the job definition yml with the sensitive data removed. Changing dbt-databricks library to less than 1.7.3 is the workaround we have found works.

benc-db · 2024-01-26T17:09:37Z

The combination of things working when switching to serverless or 1.7.3, as well as the failures giving up after < 3 minutes suggests to me that something has changed in our retry policy in databricks-sql-connector 3.0.1 (which we upgraded to with 1.7.4). Going to work with @susodapop to try to figure out what is different.

OlofL · 2024-01-26T17:13:00Z

Great! Since we are able to recreate the problem in both our environments I'm happy to help with any further testing if possible.

benc-db · 2024-01-26T17:15:43Z

@OlofL if you can get your dbt artifacts to me from one of the failed runs (either file a ticket with your Databricks contact, or you can email it to me at [email protected]), that would be very helpful.

OlofL · 2024-01-26T17:25:36Z

I can get it from our second environment, so it is in your mailbox now

benc-db · 2024-01-30T17:36:40Z

@OlofL do you see this issue at all with 1.7.3? I'm able to repro the failure this morning with 1.7.3 :/.

OlofL · 2024-01-30T17:52:18Z

Great that you can replicate it. We have only seen it in 1.7.4, not in 1.7.3.
But, to be on the safe side we have the settings to <1.7.3, so we are not running version 1.7.3 in our daily build, we are running an earlier version.
I can try it again tomorrow with version 1.7.3 if that helps to make sure it is the same problem.

andrewpdillon · 2024-01-30T20:12:49Z

We are also experiencing the same issue with 1.7.4, but not with 1.7.3. 1.7.4 works OK if the SQL Warehouse is already running - but if the workflow has to start the SQL Warehouse that is when we encounter the 503 error.

Adding retries to the task is another workaround for us at this point.

henlue · 2024-02-01T10:41:43Z

Even after upgrading to 1.7.5, I'm still facing the same bug as described in this issue. I get the exact same error message as above.
1.7.3 is working fine for me.

Does 1.7.5 resolve the issue for you @OlofL @andrewpdillon ?

My full package list:

Package                   Version
------------------------- ----------
agate                     1.7.1
annotated-types           0.6.0 
appdirs                   1.4.4                                      
argcomplete               3.1.6 
attrs                     23.2.0
Babel                     2.14.0
cachetools                5.3.2                                      
certifi                   2023.11.17
cffi                      1.16.0
cfgv                      3.4.0
chardet                   5.2.0 
charset-normalizer        3.3.2 
click                     8.1.7 
colorama                  0.4.6
commitizen                3.12.0
cryptography              42.0.2
databricks-sdk            0.17.0
databricks-sql-connector  3.0.2
dbt-core                  1.7.6
dbt-databricks            1.7.5 
dbt-extractor             0.5.1 
dbt-semantic-interfaces   0.4.3 
dbt-spark                 1.7.1 
decli                     0.6.1
diff_cover                8.0.3 
distlib                   0.3.8  
et-xmlfile                1.1.0  
filelock                  3.13.1
google-auth               2.27.0
identify                  2.5.33
idna                      3.6   
importlib-metadata        6.11.0
iniconfig                 2.0.0
isodate                   0.6.1
jaraco.classes            3.3.0
jeepney                   0.8.0
Jinja2                    3.1.3
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
keyring                   24.3.0
leather                   0.3.4
Logbook                   1.5.3
lz4                       4.3.3
MarkupSafe                2.1.4
mashumaro                 3.12
minimal-snowplow-tracker  0.0.2
more-itertools            10.2.0
msgpack                   1.0.7
networkx                  3.2.1
nodeenv                   1.8.0
numpy                     1.26.3
oauthlib                  3.2.2
openpyxl                  3.1.2
packaging                 23.2
pandas                    2.1.4
parsedatetime             2.6
pathspec                  0.11.2
pip                       23.3.2
platformdirs              4.2.0
pluggy                    1.4.0
pre-commit                2.18.1
prompt-toolkit            3.0.36
protobuf                  4.25.2
pyarrow                   14.0.2
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pycparser                 2.21
pydantic                  2.6.0
pydantic_core             2.16.1
Pygments                  2.17.2
pytest                    8.0.0
python-dateutil           2.8.2
python-slugify            8.0.2
pytimeparse               1.1.8
pytz                      2023.4
PyYAML                    6.0.1
questionary               2.0.1
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
rpds-py                   0.17.1
rsa                       4.9
SecretStorage             3.3.3
setuptools                69.0.3
six                       1.16.0
sqlfluff                  2.3.5
sqlparams                 6.0.1
sqlparse                  0.4.4
tblib                     3.0.0
termcolor                 2.4.0
text-unidecode            1.3
thrift                    0.16.0
toml                      0.10.2
tomlkit                   0.12.3
tqdm                      4.66.1
typing_extensions         4.9.0
tzdata                    2023.4
urllib3                   1.26.18
virtualenv                20.25.0
wcwidth                   0.2.13
wheel                     0.42.0
zipp                      3.17.0

benc-db · 2024-02-01T16:34:44Z

Hi all, we had to pull 1.7.5 as well because it didn't completely fix the problem. If you'd like to help validate our latest attempt, can you try installing from the 1.7.4testing branch?

benc-db · 2024-02-01T16:44:43Z

@OlofL @andrewpdillon @henlue please let me know if the above (1.7.4testing branch) resolves the issue.

benc-db · 2024-02-05T16:43:56Z

@OlofL @andrewpdillon @henlue any update? I'm looking for feedback that this branch resolves the issue before I release it, as this circumstance is very hard to repro internally.

OlofL · 2024-02-05T17:58:34Z

I configured a new workflow with 1.7.4 to first verify that I still had the error but I didn't get it this time. Is the error dependent on a specific version of dbt? I will try again.

benc-db · 2024-02-05T18:00:00Z

No, whether you see the error or not is dependent on how quickly compute is available, hence why reproducing is so hard; it is a function of the cloud providers compute availability.

OlofL · 2024-02-05T18:03:14Z

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

benc-db · 2024-02-05T18:04:04Z

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

Yeah, for me, I have the best luck reproducing at 4:30 pm PST, and I have no idea why lol.

OlofL · 2024-02-05T18:05:33Z

That is when we in Europe start all our major workloads, at 01.30 CET.

lucascajal · 2024-02-07T08:33:06Z

I have the same issue but with an older version of dbt: 1.4.9. In my case, I had a scheduled job that was working fine up to the 28th of November last year. Then it suddenly started failing the next day, without having done any changes from our side (development was on hold).

Here's the log for the 28th execution:

01:00:32  Running with dbt=1.4.9
01:00:32  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:07:03
01:07:03  Concurrency: 4 threads (target='dev')
01:07:03  1 of 17 START sql table model my_schema.dates ...... [RUN]
01:07:03  2 of 17 START sql view model my_schema.table_1  [RUN]
01:07:03  3 of 17 START sql view model my_schema.table_2  [RUN]
01:07:03  4 of 17 START sql view model my_schema.table_3  [RUN]
01:07:16  3 of 17 OK created sql view model my_schema.table_2  [�[32mOK�[0m in 12.81s]
01:07:16  5 of 17 START sql view model my_schema.table_4  [RUN]
01:07:16  4 of 17 OK created sql view model my_schema.table_3  [�[32mOK�[0m in 12.92s]
01:07:16  6 of 17 START sql view model my_schema.table_5  [RUN]
01:07:17  2 of 17 OK created sql view model my_schema.table_1  [�[32mOK�[0m in 13.79s]
01:07:17  7 of 17 START sql view model my_schema.table_6  [RUN]
01:07:24  5 of 17 OK created sql view model my_schema.table_4  [�[32mOK�[0m in 8.16s]
01:07:24  8 of 17 START sql view model my_schema.table_7  [RUN]
01:07:25  7 of 17 OK created sql view model my_schema.table_6  [�[32mOK�[0m in 7.83s]
01:07:25  9 of 17 START sql view model my_schema.table_8  [RUN]
01:07:27  6 of 17 OK created sql view model my_schema.table_5  [�[32mOK�[0m in 11.32s]
01:07:27  10 of 17 START sql view model my_schema.table_9  [RUN]
01:07:33  9 of 17 OK created sql view model my_schema.table_8  [�[32mOK�[0m in 8.56s]
01:07:33  11 of 17 START sql table model my_schema.table_10 .. [RUN]
01:07:34  10 of 17 OK created sql view model my_schema.table_9  [�[32mOK�[0m in 6.26s]
01:07:34  12 of 17 START sql view model my_schema.table_11  [RUN]
01:07:34  8 of 17 OK created sql view model my_schema.table_7  [�[32mOK�[0m in 9.78s]
01:07:34  13 of 17 START sql table model my_schema.talbe_12  [RUN]
01:07:41  12 of 17 OK created sql view model my_schema.table_11  [�[32mOK�[0m in 7.78s]
01:07:41  14 of 17 START sql view model my_schema.table_13  [RUN]
01:07:45  1 of 17 OK created sql table model my_schema.dates . [�[32mOK�[0m in 41.68s]
01:07:46  11 of 17 OK created sql table model my_schema.table_10  [�[32mOK�[0m in 12.81s]
01:07:49  14 of 17 OK created sql view model my_schema.table_13  [�[32mOK�[0m in 8.01s]
01:07:51  13 of 17 OK created sql table model my_schema.talbe_12  [�[32mOK�[0m in 17.67s]
01:07:51  15 of 17 START sql table model my_schema.orphans ...... [RUN]
01:07:51  16 of 17 START sql table model my_schema.sources_combined  [RUN]
01:08:15  15 of 17 OK created sql table model my_schema.orphans . [�[32mOK�[0m in 23.43s]
01:08:15  17 of 17 START test accepted_values  [RUN]
01:08:17  17 of 17 PASS accepted_values  [�[32mPASS�[0m in 1.83s]
01:09:22  16 of 17 OK created sql table model my_schema.sources_combined  [�[32mOK�[0m in 90.90s]
01:09:23
01:09:23  Finished running 11 view models, 5 table models, 1 test in 0 hours 8 minutes and 50.20 seconds (530.20s).
01:09:23
01:09:23  �[32mCompleted successfully�[0m
01:09:23
01:09:23  Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17

And the 29th when the error started showing up:

01:00:31  Running with dbt=1.4.9
01:00:31  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:03:06
01:03:06  Finished running  in 0 hours 2 minutes and 34.68 seconds (154.68s).
01:03:06  Encountered an error:
Runtime Error
  HTTPSConnectionPool(host='adb-xxxxxxx.x.azuredatabricks.net', port=443): Max retries exceeded with url: /sql/1.0/warehouses/xxxxxxxx (Caused by ResponseError('too many 503 error responses'))

System info:

Running on a dbt cloud job (version 1.4.9)
Azure databricks
Executing in a SQL endpoint

Hope this helps shed some more light into the issue

## Changes This adds a new dbt-sql template. This work requires the new WorkspaceFS support for dbt tasks. In this latest revision, I've hidden the new template from the list so we can merge it, iterate over it, and propertly release the template at the right time. Blockers: - [x] WorkspaceFS support for dbt projects is in prod - [x] Move dbt files into a subdirectory - [ ] Wait until the next (>1.7.4) release of the dbt plugin which will have major improvements! - _Rather than wait, this template is hidden from the list of templates._ - [x] SQL extension is preconfigured based on extension settings (if possible) - MV / streaming tables: - [x] Add to template - [x] Fix databricks/dbt-databricks#535 (to be released with in 1.7.4) - [x] Merge databricks/dbt-databricks#338 (to be released with in 1.7.4) - [ ] Fix "too many 503 errors" issue (databricks/dbt-databricks#570, internal tracker: ES-1009215, ES-1014138) - [x] Support ANSI mode in the template - [ ] Streaming tables support is either ungated or the template provides instructions about signup - _Mitigation for now: this template is hidden from the list of templates._ - [x] Support non-workspace-admin deployment - [x] Make sure `data_security_mode: SINGLE_USER` works on non-UC workspaces (it's required to be explicitly specified on UC workspaces with single-node clusters) - [x] Support non-UC workspaces ## Tests - [x] Unit tests - [x] Manual testing - [x] More manual testing - [ ] Reviewer manual testing - _I'd like to do a small bug bash post-merging._ - [x] Unit tests

OlofL added the bug Something isn't working label Jan 26, 2024

OlofL closed this as completed Jan 26, 2024

OlofL reopened this Jan 26, 2024

lennartkats-db mentioned this issue Jan 26, 2024

Add an experimental dbt-sql template databricks/cli#1059

Merged

18 tasks

benc-db mentioned this issue Jan 29, 2024

Pin to databricks-sdk 0.17.0 #571

Merged

3 tasks

benc-db mentioned this issue Feb 6, 2024

Prep for 1.7.6 #578

Merged

3 tasks

benc-db closed this as completed in #578 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

OlofL commented Jan 26, 2024

lennartkats-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

OlofL commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 30, 2024

OlofL commented Jan 30, 2024

andrewpdillon commented Jan 30, 2024

henlue commented Feb 1, 2024

benc-db commented Feb 1, 2024

benc-db commented Feb 1, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

lucascajal commented Feb 7, 2024

Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

Comments

OlofL commented Jan 26, 2024

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

lennartkats-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

OlofL commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 26, 2024

OlofL commented Jan 26, 2024

benc-db commented Jan 30, 2024

OlofL commented Jan 30, 2024

andrewpdillon commented Jan 30, 2024

henlue commented Feb 1, 2024

benc-db commented Feb 1, 2024

benc-db commented Feb 1, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

benc-db commented Feb 5, 2024

OlofL commented Feb 5, 2024

lucascajal commented Feb 7, 2024