Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks workflow with SQL Warehouse Pro crashes with http error 503 with version 1.7.4 #570

Closed
OlofL opened this issue Jan 26, 2024 · 21 comments · Fixed by #578
Closed
Labels
bug Something isn't working

Comments

@OlofL
Copy link

OlofL commented Jan 26, 2024

Describe the bug

A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster.
This occurs only in connector version 1.7.4, not in version 1.7.3.
This occurs only with SQL Warehouse, not with SQL Warehouse Serverless

Steps To Reproduce

The problem occurs in a workflow in a databricks workspace with the following settings
Running on Azure, Databricks Premium, not Unity Catalog
Job cluster single node Standard_DS3_v2
Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized
git source Azure Devops
Settings for library version dbt-databricks>=1.0.0,<2.0.0

Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:

  • dbt build --profiles-dir ../misc/misc/ -t prod -f
    02:09:19 Running with dbt=1.7.5
    02:09:20 Registered adapter: databricks=1.7.4
    02:09:20 Unable to do partial parsing because saved manifest not found. Starting full parse.
    02:09:38 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 920 macros, 0 groups, 0 semantic models
    02:09:38
    02:12:11
    02:12:11 Finished running in 0 hours 2 minutes and 32.34 seconds (152.34s).
    02:12:11 Encountered an error:
    Runtime Error
    HTTPSConnectionPool(host='[removed]', port=443): Max retries exceeded with url: /sql/1.0/warehouses/[removed] (Caused by ResponseError('too many 503 error responses'))

Changing to SQL Warehouse of type Serverless, X-Small, Cluster count: Active 0 Min 1 Max 1, Current solves the problem.
Running on older version of dbt-databricks library solves the problem (dbt-databricks>=1.0.0,<1.7.3)
The problem appeared on Jan 25 when the version was updated from 1.7.3 -> 1.7.4.
I have not been able to reproduce the same problem when starting dbt interactive, just as a workflow, but that can be my mistake, this has been a stressful couple days due to this.

Expected behavior

Log output:
02:08:27 Running with dbt=1.7.5
02:08:28 Registered adapter: databricks=1.7.3
02:08:28 Unable to do partial parsing because saved manifest not found. Starting full parse.
02:08:46 Found 393 models, 88 snapshots, 1 analysis, 5 seeds, 1443 tests, 109 sources, 8 exposures, 0 metrics, 919 macros, 0 groups, 0 semantic models
02:08:46
02:14:30 Concurrency: 12 threads (target='prod')
02:14:30
02:14:31 1 of 1835 START sql table model staging.rollup12helper ......................... [RUN]
02:14:31 2 of 1835 START sql table model staging.rollup24helper ......................... [RUN]

Screenshots and log output

N/A

System information

dbt 1.7.5
dbt-databricks 1.7.4

Configured in a dbt workflow, not much control over settings there.

Additional context

N/A

@OlofL OlofL added the bug Something isn't working label Jan 26, 2024
@lennartkats-db
Copy link
Contributor

Thanks for the report, @OlofL. Could you elaborate a bit on what kind of workload you're running? Is your model generating a regular table, streaming table, materialized view, or something else?

@OlofL
Copy link
Author

OlofL commented Jan 26, 2024

Of course. It is our standard daily build and it starts by creating table models, seeds then 88 different snapshots and then continuing with lots of incremental models, a few table models and finally a a number of views.

We have two workspaces, one is using Serverless, one is using Pro SQL Warehouses. The problem only occured in the Pro. By swithcing to Pro instead of Serverless in the other the problem occured there also.

@OlofL
Copy link
Author

OlofL commented Jan 26, 2024

No streaming tables, just a plan batch transformation.

@OlofL
Copy link
Author

OlofL commented Jan 26, 2024

job.txt
Attached is the job definition yml with the sensitive data removed. Changing dbt-databricks library to less than 1.7.3 is the workaround we have found works.

@benc-db
Copy link
Collaborator

benc-db commented Jan 26, 2024

The combination of things working when switching to serverless or 1.7.3, as well as the failures giving up after < 3 minutes suggests to me that something has changed in our retry policy in databricks-sql-connector 3.0.1 (which we upgraded to with 1.7.4). Going to work with @susodapop to try to figure out what is different.

@OlofL
Copy link
Author

OlofL commented Jan 26, 2024

Great! Since we are able to recreate the problem in both our environments I'm happy to help with any further testing if possible.

@OlofL OlofL closed this as completed Jan 26, 2024
@benc-db
Copy link
Collaborator

benc-db commented Jan 26, 2024

@OlofL if you can get your dbt artifacts to me from one of the failed runs (either file a ticket with your Databricks contact, or you can email it to me at [email protected]), that would be very helpful.

@OlofL OlofL reopened this Jan 26, 2024
@OlofL
Copy link
Author

OlofL commented Jan 26, 2024

I can get it from our second environment, so it is in your mailbox now

@benc-db
Copy link
Collaborator

benc-db commented Jan 30, 2024

@OlofL do you see this issue at all with 1.7.3? I'm able to repro the failure this morning with 1.7.3 :/.

@OlofL
Copy link
Author

OlofL commented Jan 30, 2024

Great that you can replicate it. We have only seen it in 1.7.4, not in 1.7.3.
But, to be on the safe side we have the settings to <1.7.3, so we are not running version 1.7.3 in our daily build, we are running an earlier version.
I can try it again tomorrow with version 1.7.3 if that helps to make sure it is the same problem.

@andrewpdillon
Copy link

We are also experiencing the same issue with 1.7.4, but not with 1.7.3. 1.7.4 works OK if the SQL Warehouse is already running - but if the workflow has to start the SQL Warehouse that is when we encounter the 503 error.

Adding retries to the task is another workaround for us at this point.

@henlue
Copy link
Contributor

henlue commented Feb 1, 2024

Even after upgrading to 1.7.5, I'm still facing the same bug as described in this issue. I get the exact same error message as above.
1.7.3 is working fine for me.

Does 1.7.5 resolve the issue for you @OlofL @andrewpdillon ?

My full package list:

Package                   Version
------------------------- ----------
agate                     1.7.1
annotated-types           0.6.0 
appdirs                   1.4.4                                      
argcomplete               3.1.6 
attrs                     23.2.0
Babel                     2.14.0
cachetools                5.3.2                                      
certifi                   2023.11.17
cffi                      1.16.0
cfgv                      3.4.0
chardet                   5.2.0 
charset-normalizer        3.3.2 
click                     8.1.7 
colorama                  0.4.6
commitizen                3.12.0
cryptography              42.0.2
databricks-sdk            0.17.0
databricks-sql-connector  3.0.2
dbt-core                  1.7.6
dbt-databricks            1.7.5 
dbt-extractor             0.5.1 
dbt-semantic-interfaces   0.4.3 
dbt-spark                 1.7.1 
decli                     0.6.1
diff_cover                8.0.3 
distlib                   0.3.8  
et-xmlfile                1.1.0  
filelock                  3.13.1
google-auth               2.27.0
identify                  2.5.33
idna                      3.6   
importlib-metadata        6.11.0
iniconfig                 2.0.0
isodate                   0.6.1
jaraco.classes            3.3.0
jeepney                   0.8.0
Jinja2                    3.1.3
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
keyring                   24.3.0
leather                   0.3.4
Logbook                   1.5.3
lz4                       4.3.3
MarkupSafe                2.1.4
mashumaro                 3.12
minimal-snowplow-tracker  0.0.2
more-itertools            10.2.0
msgpack                   1.0.7
networkx                  3.2.1
nodeenv                   1.8.0
numpy                     1.26.3
oauthlib                  3.2.2
openpyxl                  3.1.2
packaging                 23.2
pandas                    2.1.4
parsedatetime             2.6
pathspec                  0.11.2
pip                       23.3.2
platformdirs              4.2.0
pluggy                    1.4.0
pre-commit                2.18.1
prompt-toolkit            3.0.36
protobuf                  4.25.2
pyarrow                   14.0.2
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pycparser                 2.21
pydantic                  2.6.0
pydantic_core             2.16.1
Pygments                  2.17.2
pytest                    8.0.0
python-dateutil           2.8.2
python-slugify            8.0.2
pytimeparse               1.1.8
pytz                      2023.4
PyYAML                    6.0.1
questionary               2.0.1
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
rpds-py                   0.17.1
rsa                       4.9
SecretStorage             3.3.3
setuptools                69.0.3
six                       1.16.0
sqlfluff                  2.3.5
sqlparams                 6.0.1
sqlparse                  0.4.4
tblib                     3.0.0
termcolor                 2.4.0
text-unidecode            1.3
thrift                    0.16.0
toml                      0.10.2
tomlkit                   0.12.3
tqdm                      4.66.1
typing_extensions         4.9.0
tzdata                    2023.4
urllib3                   1.26.18
virtualenv                20.25.0
wcwidth                   0.2.13
wheel                     0.42.0
zipp                      3.17.0

@benc-db
Copy link
Collaborator

benc-db commented Feb 1, 2024

Hi all, we had to pull 1.7.5 as well because it didn't completely fix the problem. If you'd like to help validate our latest attempt, can you try installing from the 1.7.4testing branch?

image

@benc-db
Copy link
Collaborator

benc-db commented Feb 1, 2024

@OlofL @andrewpdillon @henlue please let me know if the above (1.7.4testing branch) resolves the issue.

@benc-db
Copy link
Collaborator

benc-db commented Feb 5, 2024

@OlofL @andrewpdillon @henlue any update? I'm looking for feedback that this branch resolves the issue before I release it, as this circumstance is very hard to repro internally.

@OlofL
Copy link
Author

OlofL commented Feb 5, 2024

I configured a new workflow with 1.7.4 to first verify that I still had the error but I didn't get it this time. Is the error dependent on a specific version of dbt? I will try again.

@benc-db
Copy link
Collaborator

benc-db commented Feb 5, 2024

No, whether you see the error or not is dependent on how quickly compute is available, hence why reproducing is so hard; it is a function of the cloud providers compute availability.

@OlofL
Copy link
Author

OlofL commented Feb 5, 2024

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

@benc-db
Copy link
Collaborator

benc-db commented Feb 5, 2024

Ok, that explains a lot! I will try to setup som jobs in timer, but I suspect that the next bottleneck for Azure will be around the first of March, that is typically when Azure is being bogged down.

Yeah, for me, I have the best luck reproducing at 4:30 pm PST, and I have no idea why lol.

@OlofL
Copy link
Author

OlofL commented Feb 5, 2024

That is when we in Europe start all our major workloads, at 01.30 CET.

@benc-db benc-db mentioned this issue Feb 6, 2024
3 tasks
@lucascajal
Copy link

I have the same issue but with an older version of dbt: 1.4.9. In my case, I had a scheduled job that was working fine up to the 28th of November last year. Then it suddenly started failing the next day, without having done any changes from our side (development was on hold).

Here's the log for the 28th execution:

01:00:32  Running with dbt=1.4.9
01:00:32  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:07:03
01:07:03  Concurrency: 4 threads (target='dev')
01:07:03  1 of 17 START sql table model my_schema.dates ...... [RUN]
01:07:03  2 of 17 START sql view model my_schema.table_1  [RUN]
01:07:03  3 of 17 START sql view model my_schema.table_2  [RUN]
01:07:03  4 of 17 START sql view model my_schema.table_3  [RUN]
01:07:16  3 of 17 OK created sql view model my_schema.table_2  [�[32mOK�[0m in 12.81s]
01:07:16  5 of 17 START sql view model my_schema.table_4  [RUN]
01:07:16  4 of 17 OK created sql view model my_schema.table_3  [�[32mOK�[0m in 12.92s]
01:07:16  6 of 17 START sql view model my_schema.table_5  [RUN]
01:07:17  2 of 17 OK created sql view model my_schema.table_1  [�[32mOK�[0m in 13.79s]
01:07:17  7 of 17 START sql view model my_schema.table_6  [RUN]
01:07:24  5 of 17 OK created sql view model my_schema.table_4  [�[32mOK�[0m in 8.16s]
01:07:24  8 of 17 START sql view model my_schema.table_7  [RUN]
01:07:25  7 of 17 OK created sql view model my_schema.table_6  [�[32mOK�[0m in 7.83s]
01:07:25  9 of 17 START sql view model my_schema.table_8  [RUN]
01:07:27  6 of 17 OK created sql view model my_schema.table_5  [�[32mOK�[0m in 11.32s]
01:07:27  10 of 17 START sql view model my_schema.table_9  [RUN]
01:07:33  9 of 17 OK created sql view model my_schema.table_8  [�[32mOK�[0m in 8.56s]
01:07:33  11 of 17 START sql table model my_schema.table_10 .. [RUN]
01:07:34  10 of 17 OK created sql view model my_schema.table_9  [�[32mOK�[0m in 6.26s]
01:07:34  12 of 17 START sql view model my_schema.table_11  [RUN]
01:07:34  8 of 17 OK created sql view model my_schema.table_7  [�[32mOK�[0m in 9.78s]
01:07:34  13 of 17 START sql table model my_schema.talbe_12  [RUN]
01:07:41  12 of 17 OK created sql view model my_schema.table_11  [�[32mOK�[0m in 7.78s]
01:07:41  14 of 17 START sql view model my_schema.table_13  [RUN]
01:07:45  1 of 17 OK created sql table model my_schema.dates . [�[32mOK�[0m in 41.68s]
01:07:46  11 of 17 OK created sql table model my_schema.table_10  [�[32mOK�[0m in 12.81s]
01:07:49  14 of 17 OK created sql view model my_schema.table_13  [�[32mOK�[0m in 8.01s]
01:07:51  13 of 17 OK created sql table model my_schema.talbe_12  [�[32mOK�[0m in 17.67s]
01:07:51  15 of 17 START sql table model my_schema.orphans ...... [RUN]
01:07:51  16 of 17 START sql table model my_schema.sources_combined  [RUN]
01:08:15  15 of 17 OK created sql table model my_schema.orphans . [�[32mOK�[0m in 23.43s]
01:08:15  17 of 17 START test accepted_values  [RUN]
01:08:17  17 of 17 PASS accepted_values  [�[32mPASS�[0m in 1.83s]
01:09:22  16 of 17 OK created sql table model my_schema.sources_combined  [�[32mOK�[0m in 90.90s]
01:09:23
01:09:23  Finished running 11 view models, 5 table models, 1 test in 0 hours 8 minutes and 50.20 seconds (530.20s).
01:09:23
01:09:23  �[32mCompleted successfully�[0m
01:09:23
01:09:23  Done. PASS=17 WARN=0 ERROR=0 SKIP=0 TOTAL=17

And the 29th when the error started showing up:

01:00:31  Running with dbt=1.4.9
01:00:31  Found 28 models, 1 test, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 12 sources, 0 exposures, 0 metrics
01:03:06
01:03:06  Finished running  in 0 hours 2 minutes and 34.68 seconds (154.68s).
01:03:06  Encountered an error:
Runtime Error
  HTTPSConnectionPool(host='adb-xxxxxxx.x.azuredatabricks.net', port=443): Max retries exceeded with url: /sql/1.0/warehouses/xxxxxxxx (Caused by ResponseError('too many 503 error responses'))

System info:

  • Running on a dbt cloud job (version 1.4.9)
  • Azure databricks
  • Executing in a SQL endpoint

Hope this helps shed some more light into the issue

github-merge-queue bot pushed a commit to databricks/cli that referenced this issue Feb 19, 2024
## Changes

This adds a new dbt-sql template. This work requires the new WorkspaceFS
support for dbt tasks.

In this latest revision, I've hidden the new template from the list so
we can merge it, iterate over it, and propertly release the template at
the right time.

Blockers:
- [x] WorkspaceFS support for dbt projects is in prod
- [x] Move dbt files into a subdirectory
- [ ] Wait until the next (>1.7.4) release of the dbt plugin which will
have major improvements!
- _Rather than wait, this template is hidden from the list of
templates._
- [x] SQL extension is preconfigured based on extension settings (if
possible)
- MV / streaming tables:
  - [x] Add to template
- [x] Fix databricks/dbt-databricks#535 (to be
released with in 1.7.4)
- [x] Merge databricks/dbt-databricks#338 (to be
released with in 1.7.4)
- [ ] Fix "too many 503 errors" issue
(databricks/dbt-databricks#570, internal
tracker: ES-1009215, ES-1014138)
  - [x] Support ANSI mode in the template
- [ ] Streaming tables support is either ungated or the template
provides instructions about signup
- _Mitigation for now: this template is hidden from the list of
templates._
- [x] Support non-workspace-admin deployment
- [x] Make sure `data_security_mode: SINGLE_USER` works on non-UC
workspaces (it's required to be explicitly specified on UC workspaces
with single-node clusters)
- [x] Support non-UC workspaces

## Tests

- [x] Unit tests
- [x] Manual testing
- [x] More manual testing
- [ ] Reviewer manual testing
  - _I'd like to do a small bug bash post-merging._
- [x] Unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants