Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) #347

Closed
2 tasks done
elongl opened this issue Mar 1, 2023 · 21 comments · Fixed by #601
Closed
2 tasks done

[CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) #347

elongl opened this issue Mar 1, 2023 · 21 comments · Fixed by #601
Assignees
Labels
help_wanted Extra attention is needed type:enhancement New feature or request

Comments

@elongl
Copy link

elongl commented Mar 1, 2023

Is this a regression in a recent version of dbt-redshift?

  • I believe this is a regression in dbt-redshift functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

I'm not 100% sure when does it happen, but I'm unable to seed CSVs that I previously was able to seed.
I'm getting the following error:

17:00:52  Finished running 14 seeds, 2 hooks in 0 hours 1 minutes and 22.77 seconds (82.77s).
17:00:52
17:00:52  Completed with 3 errors and 0 warnings:
17:00:52
17:00:52  Runtime Error in seed numeric_column_anomalies_training (data/training/numeric_column_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Runtime Error in seed dimension_anomalies_training (data/training/dimension_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Runtime Error in seed any_type_column_anomalies_training (data/training/any_type_column_anomalies_training.csv)
17:00:52    'h' format requires -32768 <= number <= 32767
17:00:52
17:00:52  Done. PASS=11 WARN=0 ERROR=3 SKIP=0 TOTAL=14

Expected/Previous Behavior

I should be able to seed them.

Steps To Reproduce

  1. Load this file to your seeds directory.
  2. Run dbt seed.

Here's another file if needed.

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.9.6
- dbt-core (working version): 1.4.0
- dbt-redshift (working version): 1.4.0
- dbt-core (regression version): 1.5.0b2
- dbt-redshift (regression version): 1.5.0b1

Additional Context

No response

@elongl elongl added type:bug Something isn't working triage:product labels Mar 1, 2023
@github-actions github-actions bot changed the title [Regression] dbt seed doesn't work on certain cases. [CT-2243] [Regression] dbt seed doesn't work on certain cases. Mar 1, 2023
@dbeatty10 dbeatty10 self-assigned this Mar 1, 2023
@dbeatty10
Copy link
Contributor

Thanks for discovering and reporting this @elongl 🏆

I was able to confirm this as a regression introduced in dbt-redshift 1.5.0b1. My guess is that it has something to do with us using a new database driver. We migrated from using psycopg2 to redshift-connectorin #251.

It appears to be something related to the length of the seed file rather than the content itself.

e.g., the dimension_anomalies_training.csv you provided me gave the exception you reported, but once I split the same content into a max of 3000 lines per file instead, it worked.

I don't think it's purely the number of lines since numeric_column_anomalies_training.csv needs to be split into files with fewer number of lines than dimension_anomalies_training.csv. So probably something having to do with an underlying number of bytes instead.

@dbeatty10 dbeatty10 removed their assignment Mar 1, 2023
@dbeatty10
Copy link
Contributor

A workaround you can use:

  • set the max batch size for seeds within your project

macros/get_batch_size.sql

{% macro get_batch_size() %}
  {{ return(1000) }}
{% endmacro %}

@Fleid
Copy link
Contributor

Fleid commented Mar 1, 2023

@dbeatty10, do you think that it's just the SQL statement that's too long?
Looking at the log:

[0m21:39:05.373543 [debug] [Thread--1 (]: Redshift adapter: Error running SQL: insert into "ci"."DBT_fleid_2243"."dimension_anomalies_training" ("updated_at", "platform", "version", "user_id") values
(%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)...
...
(%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)
[0m21:39:05.373706 [debug] [Thread-1 (]: Redshift adapter: Rolling back transaction.
[0m21:39:05.373798 [debug] [Thread-1 (]: On seed.ct_2243.dimension_anomalies_training: ROLLBACK
[0m21:39:05.554898 [debug] [Thread-1 (]: Timing info for seed.ct_2243.dimension_anomalies_training (execute): 2023-03-01 21:38:57.320592 => 2023-03-01 21:39:05.554739
[0m21:39:05.555464 [debug] [Thread-1 (]: On seed.ct_2243.dimension_anomalies_training: Close
[0m21:39:05.586023 [debug] [Thread-1 (]: Runtime Error in seed dimension_anomalies_training (seeds/dimension_anomalies_training.csv)
  'h' format requires -32768 <= number <= 32767

That last error 'h' format requires -32768 <= number <= 32767 still feels like a data value issue to me. Are we sure it's a specific row that's fishy towards the end of the file? Nah, I tried replacing values, and the same error occurs. So that smells like something within the new redshift connector library.

@sathiish-kumar does that ring a bell to you at all?

@Fleid Fleid self-assigned this Mar 1, 2023
@sathiish-kumar
Copy link
Contributor

@Fleid confirming with my colleague who manages the python driver on whether she has seen this before. Without knowing too much about what the exact query was, one guess here is if the data type that is being used is incorrect (they appear to be using string instead of a number type?)? Apologies if this is an already foregone conclusion.

@elongl
Copy link
Author

elongl commented Mar 1, 2023

You can take inspiration from how we insert rows if you're interested.

@dbeatty10
Copy link
Contributor

@Fleid yes, I think the statement is too "long", and it just depends what too "long" actually means 🤔

This thread has an interesting comment:

Just discovered another limit ... you can only insert up to 32,768 (short/2byte) parameters into a SQL statement. Something you can definitely hit before 16MB, if you have small columns.

@Fleid
Copy link
Contributor

Fleid commented Mar 2, 2023

@sathiish-kumar it really looks like the statement is just too long now, but it wasn't before.

That's what surprising to me though @dbeatty10, was the previous connection library working some magic under the scene to split that up? Or some default option changed? That's weird.

@dbeatty10
Copy link
Contributor

Yeah, I'm not sure how psycopg2 was handling this differently than redshift-connector 🤷

@Fleid
Copy link
Contributor

Fleid commented Mar 13, 2023

After discussion with the Redshift team, there isn't anything that can be done at library level. If we want to restore the behavior, we will need to add new functionality in dbt-redsfhit.

Short term, there is nothing to be done except:

  • staying on the older library (dbt-redshift 1.4)
  • splitting the seeds in 5 to 10K files and union-ing them post ingestion via a model

Medium term, I was thinking of closing this issue as won't fix and opening a new issue asking for help for an enhancement. But I'd rather just re-tag this one if nobody minds.

@Fleid Fleid added type:enhancement New feature or request help_wanted Extra attention is needed and removed type:bug Something isn't working labels Mar 13, 2023
@Fleid Fleid changed the title [CT-2243] [Regression] dbt seed doesn't work on certain cases. [CT-2243] [Regression] dbt seed doesn't work on certain cases (large row counts) Mar 13, 2023
@elongl
Copy link
Author

elongl commented Mar 14, 2023

@Fleid Care to elaborate a bit about why is this a no-fix?
I think it's a pretty basic feature that dbt offers and it's not going to work on one of the most popular adapters.

@dbeatty10
Copy link
Contributor

@Fleid converted this from a bug into an enhancement / feature request. And then he also labeled it as help_wanted if a motivated community member wants to take it on. He did these conversions instead of closing this issue as wontfix and then opening a brand new issue.

I didn't deeply consider how you insert rows @elongl, but maybe something similar might be a good path forward here.

@Fleid
Copy link
Contributor

Fleid commented Mar 14, 2023

Thanks @dbeatty10 :)

@elongl this is an imperfect but honest take on the topic:

  • Seeds are not built to handle "large" volume of data. The fact that "large" row counts could be loaded via dbt-redshift in the first place, to me is a nice side effect more than a core capability. This is my personal opinion, and I recognize that it's a bit spicy
  • We want the new library to connect to Redshift much more than not breaking large row counts seeds. The connector unlocks a lot of scenarios, and using it remove long term tech debt. Plus there are workarounds for large seeds
  • Restoring the lost capability requires development time, which is weighted against all the other work that needs to happen in dbt-redshift and the other adapters we support. It's of low priority against the rest of that work. The mechanism for us then is to tag the work as help_wanted and hope a contributor will be able to help

I recognize that this is not a satisfactory answer. Something worked in 1.4, and we know it won't in 1.5. I'm sorry we can't do better.

@misteliy
Copy link

I comprehend the intention to prevent the misuse of dbt seed. However, if a feature was functioning correctly in version 1.4 but suddenly stops working, it is undoubtedly a bug. Consequently, the response from the redshift team is not truly satisfactory.

@dbeatty10
Copy link
Contributor

@misteliy we totally understand your perspective on this.

Regardless of how this is labeled (bug vs. enhancement), these three options are the best we can offer at this time:

  1. workaround using get_batch_size
  2. label this issue as help_wanted with the hope that a open-source contributor will raise a pull request that we can review
  3. continue using dbt-redshift 1.4.x until one of the above works

@jaklan
Copy link

jaklan commented May 19, 2023

@dbeatty10 @Fleid even if we agree that loading "large" files is not a right use-case for dbt seed, we just can't say it's not a bug, it's a feature and move on for plenty of reasons:

  1. That issue breaks existing dbt projects without any prior notice.
  2. Such decision should be taken on dbt-core level, not dbt-redshift, be documented as a breaking change, and be consistent across various adapters.
  3. Proper logic should be handled on the dbt side (e.g. rejecting seeds above some file size), not by some random error with no obvious meaning.
  4. The "right usage" boundaries should be clearly defined, now we don't even know what "large" file means due to the above point.

Personally I disagree with labelling such significant regressions with help_wanted label and just shifting the responsibility onto the community, because such attitude lowers trust in the project's stability and maturity. dbt-core is of course an open-source library, but dbt Labs itself is not a non-profit company and this way you also affect your actual customers. I strongly believe any bugs breaking existing projects should be treated with a high priority by dbt team.

Having said that, of course we are going to play with a workaround, but I really wanted to highlight it shouldn't be a way to go.

By the way, what does help_wanted actually mean at the moment? You would accept a fix or you would rather accept making this bug a proper feature? It's not really clear what's your direction here.

@misteliy
Copy link

misteliy commented May 20, 2023

I would highly encourage the AWS team to look into psycopg2 and how they magically solved the problem of “large” seeds. To me this is clearly a driver issue because with the previous version it worked. So maybe you can share the technical response of the AWS team which details out how psycopg2 is solving it and why it’s not possible with the AWS native redshift driver…maybe it would be enough to have a parameter for default batch size (like the macro workaround you shared) as part of the AWS library?

@Fleid
Copy link
Contributor

Fleid commented May 24, 2023

@jaklan your points are entirely valid. I 100% agree that the proper way to address this issue is by keeping it as a bug, and either fixing it, or handling the new behavior in a more elegant way (like the ones you described). I had hoped that my initial comment reflected that point of view, but now I'm realizing I may not have been clearer on that.

But I need to be realistic. We don't have the capacity to do that at the moment. I'd rather flag it to the community as an area where we need help, than let it rot in my backlog. Because that is what would happen right now. It's a hard decision, but between seed size and any of the other open regressions (% being rejected where it shouldn't, timeouts being shortened to uselessness, transactions not being committed, or SSL support being dropped...), I'm prioritizing the later. That is because either the impact is wider/deeper or there is no workaround yet.

When the rest of these regressions are handled, and if this issue is not resolved by the community in the meantime, we can certainly re-visit this decision. I'm sorry I don't have a better answer here, and I hope you understand.

@dbeatty10
Copy link
Contributor

Workaround

We lowered the default batch size for seeds to try to help mitigate this issue (#468).

Since the workaround doesn't address the root cause, we are leaving this issue open in the meantime.

Root cause

redshift_connector is a descendent of pg8000, and the root cause is discussed in tlocke/pg8000#120 (comment).

@dbeatty10
Copy link
Contributor

A solution may involve implementing setinputsizes.

pg8000's implementation could be used as an inspiration.

@misteliy
Copy link

@dataders
Copy link
Contributor

dataders commented Jun 6, 2023

just opened aws/amazon-redshift-python-driver#165 to see if we can have the issue resolved upstream. thanks everyone for your help getting to the bottom of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Extra attention is needed type:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants