[CT-2243] [Regression] `dbt seed` doesn't work on certain cases (large row counts) #347

elongl · 2023-03-01T17:06:04Z

Is this a regression in a recent version of dbt-redshift?

I believe this is a regression in dbt-redshift functionality
I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

I'm not 100% sure when does it happen, but I'm unable to seed CSVs that I previously was able to seed.
I'm getting the following error:

17:00:52  Finished running 14 seeds, 2 hooks in 0 hours 1 minutes and 22.77 seconds (82.77s).
17:00:52
17:00:52  Completed with 3 errors and 0 warnings:
17:00:52
17:00:52  Runtime Error in seed numeric_column_anomalies_training (data/training/numeric_column_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Runtime Error in seed dimension_anomalies_training (data/training/dimension_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Runtime Error in seed any_type_column_anomalies_training (data/training/any_type_column_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Done. PASS=11 WARN=0 ERROR=3 SKIP=0 TOTAL=14

Expected/Previous Behavior

I should be able to seed them.

Steps To Reproduce

Load this file to your seeds directory.
Run dbt seed.

Here's another file if needed.

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.9.6
- dbt-core (working version): 1.4.0
- dbt-redshift (working version): 1.4.0
- dbt-core (regression version): 1.5.0b2
- dbt-redshift (regression version): 1.5.0b1

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-03-01T19:55:55Z

Thanks for discovering and reporting this @elongl 🏆

I was able to confirm this as a regression introduced in dbt-redshift 1.5.0b1. My guess is that it has something to do with us using a new database driver. We migrated from using psycopg2 to redshift-connectorin #251.

It appears to be something related to the length of the seed file rather than the content itself.

e.g., the dimension_anomalies_training.csv you provided me gave the exception you reported, but once I split the same content into a max of 3000 lines per file instead, it worked.

I don't think it's purely the number of lines since numeric_column_anomalies_training.csv needs to be split into files with fewer number of lines than dimension_anomalies_training.csv. So probably something having to do with an underlying number of bytes instead.

dbeatty10 · 2023-03-01T20:20:02Z

A workaround you can use:

set the max batch size for seeds within your project

macros/get_batch_size.sql

{% macro get_batch_size() %}
  {{ return(1000) }}
{% endmacro %}

Fleid · 2023-03-01T21:43:34Z

@dbeatty10, do you think that it's just the SQL statement that's too long?
Looking at the log:

[0m21:39:05.373543 [debug] [Thread--1 (]: Redshift adapter: Error running SQL: insert into "ci"."DBT_fleid_2243"."dimension_anomalies_training" ("updated_at", "platform", "version", "user_id") values
(%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)...
...
(%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)
[0m21:39:05.373706 [debug] [Thread-1 (]: Redshift adapter: Rolling back transaction.
[0m21:39:05.373798 [debug] [Thread-1 (]: On seed.ct_2243.dimension_anomalies_training: ROLLBACK
[0m21:39:05.554898 [debug] [Thread-1 (]: Timing info for seed.ct_2243.dimension_anomalies_training (execute): 2023-03-01 21:38:57.320592 => 2023-03-01 21:39:05.554739
[0m21:39:05.555464 [debug] [Thread-1 (]: On seed.ct_2243.dimension_anomalies_training: Close
[0m21:39:05.586023 [debug] [Thread-1 (]: Runtime Error in seed dimension_anomalies_training (seeds/dimension_anomalies_training.csv)
  'h' format requires -32768 <= number <= 32767

~~That last error 'h' format requires -32768 <= number <= 32767 still feels like a data value issue to me. Are we sure it's a specific row that's fishy towards the end of the file?~~ Nah, I tried replacing values, and the same error occurs. So that smells like something within the new redshift connector library.

@sathiish-kumar does that ring a bell to you at all?

sathiish-kumar · 2023-03-01T22:12:47Z

@Fleid confirming with my colleague who manages the python driver on whether she has seen this before. Without knowing too much about what the exact query was, one guess here is if the data type that is being used is incorrect (they appear to be using string instead of a number type?)? Apologies if this is an already foregone conclusion.

elongl · 2023-03-01T22:48:38Z

You can take inspiration from how we insert rows if you're interested.

dbeatty10 · 2023-03-01T22:54:26Z

@Fleid yes, I think the statement is too "long", and it just depends what too "long" actually means 🤔

This thread has an interesting comment:

Just discovered another limit ... you can only insert up to 32,768 (short/2byte) parameters into a SQL statement. Something you can definitely hit before 16MB, if you have small columns.

Fleid · 2023-03-02T00:05:09Z

@sathiish-kumar it really looks like the statement is just too long now, but it wasn't before.

That's what surprising to me though @dbeatty10, was the previous connection library working some magic under the scene to split that up? Or some default option changed? That's weird.

dbeatty10 · 2023-03-09T00:55:33Z

Yeah, I'm not sure how psycopg2 was handling this differently than redshift-connector 🤷

Fleid · 2023-03-13T20:32:39Z

After discussion with the Redshift team, there isn't anything that can be done at library level. If we want to restore the behavior, we will need to add new functionality in dbt-redsfhit.

Short term, there is nothing to be done except:

staying on the older library (dbt-redshift 1.4)
splitting the seeds in 5 to 10K files and union-ing them post ingestion via a model

Medium term, I was thinking of closing this issue as won't fix and opening a new issue asking for help for an enhancement. But I'd rather just re-tag this one if nobody minds.

elongl · 2023-03-14T10:30:05Z

@Fleid Care to elaborate a bit about why is this a no-fix?
I think it's a pretty basic feature that dbt offers and it's not going to work on one of the most popular adapters.

dbeatty10 · 2023-03-14T14:05:39Z

@Fleid converted this from a bug into an enhancement / feature request. And then he also labeled it as help_wanted if a motivated community member wants to take it on. He did these conversions instead of closing this issue as wontfix and then opening a brand new issue.

I didn't deeply consider how you insert rows @elongl, but maybe something similar might be a good path forward here.

Fleid · 2023-03-14T20:23:16Z

Thanks @dbeatty10 :)

@elongl this is an imperfect but honest take on the topic:

Seeds are not built to handle "large" volume of data. The fact that "large" row counts could be loaded via dbt-redshift in the first place, to me is a nice side effect more than a core capability. This is my personal opinion, and I recognize that it's a bit spicy
We want the new library to connect to Redshift much more than not breaking large row counts seeds. The connector unlocks a lot of scenarios, and using it remove long term tech debt. Plus there are workarounds for large seeds
Restoring the lost capability requires development time, which is weighted against all the other work that needs to happen in dbt-redshift and the other adapters we support. It's of low priority against the rest of that work. The mechanism for us then is to tag the work as help_wanted and hope a contributor will be able to help

I recognize that this is not a satisfactory answer. Something worked in 1.4, and we know it won't in 1.5. I'm sorry we can't do better.

misteliy · 2023-05-19T18:30:10Z

I comprehend the intention to prevent the misuse of dbt seed. However, if a feature was functioning correctly in version 1.4 but suddenly stops working, it is undoubtedly a bug. Consequently, the response from the redshift team is not truly satisfactory.

dbeatty10 · 2023-05-19T18:48:35Z

@misteliy we totally understand your perspective on this.

Regardless of how this is labeled (bug vs. enhancement), these three options are the best we can offer at this time:

workaround using get_batch_size
label this issue as help_wanted with the hope that a open-source contributor will raise a pull request that we can review
continue using dbt-redshift 1.4.x until one of the above works

jaklan · 2023-05-19T22:27:40Z

@dbeatty10 @Fleid even if we agree that loading "large" files is not a right use-case for dbt seed, we just can't say it's not a bug, it's a feature and move on for plenty of reasons:

That issue breaks existing dbt projects without any prior notice.
Such decision should be taken on dbt-core level, not dbt-redshift, be documented as a breaking change, and be consistent across various adapters.
Proper logic should be handled on the dbt side (e.g. rejecting seeds above some file size), not by some random error with no obvious meaning.
The "right usage" boundaries should be clearly defined, now we don't even know what "large" file means due to the above point.

Personally I disagree with labelling such significant regressions with help_wanted label and just shifting the responsibility onto the community, because such attitude lowers trust in the project's stability and maturity. dbt-core is of course an open-source library, but dbt Labs itself is not a non-profit company and this way you also affect your actual customers. I strongly believe any bugs breaking existing projects should be treated with a high priority by dbt team.

Having said that, of course we are going to play with a workaround, but I really wanted to highlight it shouldn't be a way to go.

By the way, what does help_wanted actually mean at the moment? You would accept a fix or you would rather accept making this bug a proper feature? It's not really clear what's your direction here.

misteliy · 2023-05-20T07:51:35Z

I would highly encourage the AWS team to look into psycopg2 and how they magically solved the problem of “large” seeds. To me this is clearly a driver issue because with the previous version it worked. So maybe you can share the technical response of the AWS team which details out how psycopg2 is solving it and why it’s not possible with the AWS native redshift driver…maybe it would be enough to have a parameter for default batch size (like the macro workaround you shared) as part of the AWS library?

Fleid · 2023-05-24T21:06:00Z

@jaklan your points are entirely valid. I 100% agree that the proper way to address this issue is by keeping it as a bug, and either fixing it, or handling the new behavior in a more elegant way (like the ones you described). I had hoped that my initial comment reflected that point of view, but now I'm realizing I may not have been clearer on that.

But I need to be realistic. We don't have the capacity to do that at the moment. I'd rather flag it to the community as an area where we need help, than let it rot in my backlog. Because that is what would happen right now. It's a hard decision, but between seed size and any of the other open regressions (% being rejected where it shouldn't, timeouts being shortened to uselessness, transactions not being committed, or SSL support being dropped...), I'm prioritizing the later. That is because either the impact is wider/deeper or there is no workaround yet.

When the rest of these regressions are handled, and if this issue is not resolved by the community in the meantime, we can certainly re-visit this decision. I'm sorry I don't have a better answer here, and I hope you understand.

dbeatty10 · 2023-05-28T14:12:19Z

Workaround

We lowered the default batch size for seeds to try to help mitigate this issue (#468).

Since the workaround doesn't address the root cause, we are leaving this issue open in the meantime.

Root cause

redshift_connector is a descendent of pg8000, and the root cause is discussed in tlocke/pg8000#120 (comment).

dbeatty10 · 2023-05-28T14:32:41Z

A solution may involve implementing setinputsizes.

pg8000's implementation could be used as an inspiration.

misteliy · 2023-05-31T12:02:40Z

@dbeatty10 so it is going back to AWS https://github.com/aws/amazon-redshift-python-driver/blob/910be71314229e5b2febc60ddf8b8bcc992ea5f0/redshift_connector/cursor.py#L471-L476 after all? Is the team aware and working on it?

dataders · 2023-06-06T20:14:16Z

just opened aws/amazon-redshift-python-driver#165 to see if we can have the issue resolved upstream. thanks everyone for your help getting to the bottom of this!

elongl added type:bug Something isn't working triage:product labels Mar 1, 2023

github-actions bot changed the title ~~[Regression] dbt seed doesn't work on certain cases.~~ [CT-2243] [Regression] dbt seed doesn't work on certain cases. Mar 1, 2023

dbeatty10 self-assigned this Mar 1, 2023

dbeatty10 removed their assignment Mar 1, 2023

Fleid removed the triage:product label Mar 1, 2023

Fleid self-assigned this Mar 1, 2023

Fleid added type:enhancement New feature or request help_wanted Extra attention is needed and removed type:bug Something isn't working labels Mar 13, 2023

Fleid changed the title ~~[CT-2243] [Regression] dbt seed doesn't work on certain cases.~~ [CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) Mar 13, 2023

dbeatty10 mentioned this issue May 19, 2023

[ADAP-563] [Regression] Seeds larger than ~200KB are not loaded in 1.5 #455

Closed

2 tasks

github-actions bot added the triage:product label May 19, 2023

dbeatty10 removed the triage:product label May 19, 2023

github-actions bot added the triage:product label May 19, 2023

Fleid removed the triage:product label May 24, 2023

dbeatty10 mentioned this issue May 24, 2023

Use smaller batch size for seeds #468

Merged

6 tasks

Fleid mentioned this issue May 24, 2023

[ADAP-579] [PR Tracking] Use smaller batch size for seeds #468 #469

Closed

dataders mentioned this issue Jun 6, 2023

INSERT statements with many parameters: 'h' format requires -32768 <= number <= 32767 aws/amazon-redshift-python-driver#165

Closed

soksamnanglim mentioned this issue Sep 12, 2023

Update redshift-connector requirement from 2.0.913 to 2.0.915 #601

Merged

5 tasks

colin-rogers-dbt closed this as completed in #601 Oct 23, 2023

emielver mentioned this issue Dec 21, 2023

Feature/add base macro passthroughs snowplow/dbt-snowplow-ecommerce#40

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2243] [Regression] `dbt seed` doesn't work on certain cases (large row counts) #347

[CT-2243] [Regression] `dbt seed` doesn't work on certain cases (large row counts) #347

elongl commented Mar 1, 2023

dbeatty10 commented Mar 1, 2023

dbeatty10 commented Mar 1, 2023

Fleid commented Mar 1, 2023 •

edited

Loading

sathiish-kumar commented Mar 1, 2023

elongl commented Mar 1, 2023

dbeatty10 commented Mar 1, 2023

Fleid commented Mar 2, 2023

dbeatty10 commented Mar 9, 2023

Fleid commented Mar 13, 2023 •

edited

Loading

elongl commented Mar 14, 2023

dbeatty10 commented Mar 14, 2023

Fleid commented Mar 14, 2023

misteliy commented May 19, 2023

dbeatty10 commented May 19, 2023

jaklan commented May 19, 2023 •

edited

Loading

misteliy commented May 20, 2023 •

edited

Loading

Fleid commented May 24, 2023

dbeatty10 commented May 28, 2023

dbeatty10 commented May 28, 2023

misteliy commented May 31, 2023

dataders commented Jun 6, 2023

[CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) #347

[CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) #347

Comments

elongl commented Mar 1, 2023

Is this a regression in a recent version of dbt-redshift?

Current Behavior

Expected/Previous Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

dbeatty10 commented Mar 1, 2023

dbeatty10 commented Mar 1, 2023

Fleid commented Mar 1, 2023 • edited Loading

sathiish-kumar commented Mar 1, 2023

elongl commented Mar 1, 2023

dbeatty10 commented Mar 1, 2023

Fleid commented Mar 2, 2023

dbeatty10 commented Mar 9, 2023

Fleid commented Mar 13, 2023 • edited Loading

elongl commented Mar 14, 2023

dbeatty10 commented Mar 14, 2023

Fleid commented Mar 14, 2023

misteliy commented May 19, 2023

dbeatty10 commented May 19, 2023

jaklan commented May 19, 2023 • edited Loading

misteliy commented May 20, 2023 • edited Loading

Fleid commented May 24, 2023

dbeatty10 commented May 28, 2023

Workaround

Root cause

dbeatty10 commented May 28, 2023

misteliy commented May 31, 2023

dataders commented Jun 6, 2023

[CT-2243] [Regression] `dbt seed` doesn't work on certain cases (large row counts) #347

[CT-2243] [Regression] `dbt seed` doesn't work on certain cases (large row counts) #347

Fleid commented Mar 1, 2023 •

edited

Loading

Fleid commented Mar 13, 2023 •

edited

Loading

jaklan commented May 19, 2023 •

edited

Loading

misteliy commented May 20, 2023 •

edited

Loading