-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2243] [Regression] dbt seed
doesn't work on certain cases (large row counts)
#347
Comments
dbt seed
doesn't work on certain cases.dbt seed
doesn't work on certain cases.
Thanks for discovering and reporting this @elongl 🏆 I was able to confirm this as a regression introduced in dbt-redshift 1.5.0b1. My guess is that it has something to do with us using a new database driver. We migrated from using It appears to be something related to the length of the seed file rather than the content itself. e.g., the I don't think it's purely the number of lines since numeric_column_anomalies_training.csv needs to be split into files with fewer number of lines than dimension_anomalies_training.csv. So probably something having to do with an underlying number of bytes instead. |
A workaround you can use:
{% macro get_batch_size() %}
{{ return(1000) }}
{% endmacro %} |
@dbeatty10, do you think that it's just the SQL statement that's too long?
@sathiish-kumar does that ring a bell to you at all? |
@Fleid confirming with my colleague who manages the python driver on whether she has seen this before. Without knowing too much about what the exact query was, one guess here is if the data type that is being used is incorrect (they appear to be using string instead of a number type?)? Apologies if this is an already foregone conclusion. |
You can take inspiration from how we insert rows if you're interested. |
@Fleid yes, I think the statement is too "long", and it just depends what too "long" actually means 🤔 This thread has an interesting comment:
|
@sathiish-kumar it really looks like the statement is just too long now, but it wasn't before. That's what surprising to me though @dbeatty10, was the previous connection library working some magic under the scene to split that up? Or some default option changed? That's weird. |
Yeah, I'm not sure how psycopg2 was handling this differently than |
After discussion with the Redshift team, there isn't anything that can be done at library level. If we want to restore the behavior, we will need to add new functionality in dbt-redsfhit. Short term, there is nothing to be done except:
Medium term, I was thinking of closing this issue as |
dbt seed
doesn't work on certain cases.dbt seed
doesn't work on certain cases (large row counts)
@Fleid Care to elaborate a bit about why is this a no-fix? |
@Fleid converted this from a I didn't deeply consider how you insert rows @elongl, but maybe something similar might be a good path forward here. |
Thanks @dbeatty10 :) @elongl this is an imperfect but honest take on the topic:
I recognize that this is not a satisfactory answer. Something worked in 1.4, and we know it won't in 1.5. I'm sorry we can't do better. |
I comprehend the intention to prevent the misuse of dbt seed. However, if a feature was functioning correctly in version 1.4 but suddenly stops working, it is undoubtedly a bug. Consequently, the response from the redshift team is not truly satisfactory. |
@misteliy we totally understand your perspective on this. Regardless of how this is labeled (bug vs. enhancement), these three options are the best we can offer at this time:
|
@dbeatty10 @Fleid even if we agree that loading "large" files is not a right use-case for
Personally I disagree with labelling such significant regressions with Having said that, of course we are going to play with a workaround, but I really wanted to highlight it shouldn't be a way to go. By the way, what does |
I would highly encourage the AWS team to look into |
@jaklan your points are entirely valid. I 100% agree that the proper way to address this issue is by keeping it as a bug, and either fixing it, or handling the new behavior in a more elegant way (like the ones you described). I had hoped that my initial comment reflected that point of view, but now I'm realizing I may not have been clearer on that. But I need to be realistic. We don't have the capacity to do that at the moment. I'd rather flag it to the community as an area where we need help, than let it rot in my backlog. Because that is what would happen right now. It's a hard decision, but between seed size and any of the other open regressions ( When the rest of these regressions are handled, and if this issue is not resolved by the community in the meantime, we can certainly re-visit this decision. I'm sorry I don't have a better answer here, and I hope you understand. |
WorkaroundWe lowered the default batch size for seeds to try to help mitigate this issue (#468). Since the workaround doesn't address the root cause, we are leaving this issue open in the meantime. Root cause
|
A solution may involve implementing pg8000's implementation could be used as an inspiration. |
@dbeatty10 so it is going back to AWS https://github.com/aws/amazon-redshift-python-driver/blob/910be71314229e5b2febc60ddf8b8bcc992ea5f0/redshift_connector/cursor.py#L471-L476 after all? Is the team aware and working on it? |
just opened aws/amazon-redshift-python-driver#165 to see if we can have the issue resolved upstream. thanks everyone for your help getting to the bottom of this! |
Is this a regression in a recent version of dbt-redshift?
Current Behavior
I'm not 100% sure when does it happen, but I'm unable to seed CSVs that I previously was able to seed.
I'm getting the following error:
Expected/Previous Behavior
I should be able to seed them.
Steps To Reproduce
seeds
directory.dbt seed
.Here's another file if needed.
Relevant log output
No response
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: