Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1034] [Bug] Unexpected behavior when chaining methods on dbt-ref'ed/sourced dataframes #5646

Closed
2 tasks done
lostmygithubaccount opened this issue Aug 11, 2022 · 6 comments · Fixed by #5677
Closed
2 tasks done
Labels
bug Something isn't working python_models

Comments

@lostmygithubaccount
Copy link
Contributor

lostmygithubaccount commented Aug 11, 2022

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Community Slack report here:

https://getdbt.slack.com/archives/C03QUA7DWCW/p1660223963454249

I reproduced easily -- the first run had df = df.limit(100) line uncommented and no chained call on the prior line, the second run is what's shown in the screenshot:

image

Expected Behavior

Would not expect method chaining to cause errors like this on dbt.ref(...) or dbt.source(...). We should clearly document this behavior and give a better error method if there's some underlying reason why we can't easily fix this, otherwise we should fix it.

Steps To Reproduce

See screenshot above. Basically add a .limit(N) to your reference call, e.g. df = dbt.ref(model_name).limit(100).

Relevant log output

@lostmygithubaccount ➜ /workspaces/codyspace/nyc_taxi_with_python (main ✗) $ dbt run -s int_yellow 
17:10:04  Running with dbt=1.3.0-b1
17:10:04  Found 5 models, 0 tests, 0 snapshots, 1 analysis, 571 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
17:10:04  
17:10:05  Concurrency: 8 threads (target='dev')
17:10:05  
17:10:05  1 of 1 START python table model dbt_cody.int_yellow ............................ [RUN]
17:10:10  1 of 1 ERROR creating python table model dbt_cody.int_yellow ................... [ERROR in 4.58s]
17:10:10  
17:10:10  Finished running 1 table model in 0 hours 0 minutes and 5.57 seconds (5.57s).
17:10:10  
17:10:10  Completed with 1 error and 0 warnings:
17:10:10  
17:10:10  Database Error in model int_yellow (models/intermediate/int_yellow.py)
17:10:10    100357 (P0000): Python Interpreter Error:
17:10:10    Traceback (most recent call last):
17:10:10      File "_udf_code.py", line 83, in main
17:10:10      File "_udf_code.py", line 7, in model
17:10:10      File "_udf_code.py", line 53, in <lambda>
17:10:10      File "_udf_code.py", line 21, in ref
17:10:10    KeyError: 'stg_yellow'
17:10:10     in function INT_YELLOW__DBT_SP with handler main
17:10:10    compiled Code at target/run/nyc_taxi_with_python/models/intermediate/int_yellow.py
17:10:10  
17:10:10  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Environment

- OS: linux
- Python: 3.10
- dbt: 1.3 beta, installed via main branches

Which database adapter are you using with dbt?

snowflake

Additional Context

cc: @ChenyuLInx

No response

@lostmygithubaccount lostmygithubaccount added bug Something isn't working triage Team:Execution and removed triage labels Aug 11, 2022
@github-actions github-actions bot changed the title [Bug] Unexpected behavior when chaining methods on dbt-ref'ed/sourced dataframes [CT-1034] [Bug] Unexpected behavior when chaining methods on dbt-ref'ed/sourced dataframes Aug 11, 2022
@ChenyuLInx
Copy link
Contributor

@lostmygithubaccount I looked into this issue a bit and noticed that this is related to how we traverse the parsed AST.
We can go with either options you purposed, through proper message is slightly easier. But properly fix this is totally doable, we can switch from just traverse the AST tree of python models to also use regex to parse refs and sources. Will estimate this in the upcoming BLG and we can decide which way we want to go.

Note to myself: some example regex for dbt.ref

(dbt\.ref\()[^\)\}]*(\))

@jtcohen6
Copy link
Contributor

Without having tested at all locally — I'd guess that the regex approach risks being significantly slower, and potentially liable to abuse, versus using Python's AST. I'd lean slightly in the direction of just documenting that dbt.ref() + dbt.source() cannot be chained with other method calls. It's an extra line of code for the user, not a big usability issue.

@elongl
Copy link

elongl commented Aug 12, 2022

As the one who experienced and reported the issue, I personally think that it's preferable to give a technical solution here.
It's probably going to confuse a lot of users and that they will stumble upon it given that dbt.ref() and dbt.source() are among the most used functions that will be used in Python models.

@ChenyuLInx
Copy link
Contributor

@jtcohen6 thinking again it should be possible to do all of this using the AST NodeVisitor. When I get to it, if it is really tricky to support syntax like dbt.ref(model_name).limit(100) or fancy_transform(dbt.ref(model_name)), I will make sure we raise a clear error.

@lostmygithubaccount
Copy link
Contributor Author

@elongl definitely agree that'd be ideal, but to @jtcohen6's point we should be conscious of the implications. will defer to @ChenyuLInx to further investigate the options and performance/other technical impacts before we make a call.

@elongl appreciate you reporting this! it is something that's going to trip people up and good to catch early in the beta

@iknox-fa
Copy link
Contributor

As part of estimation we're including doing a very basic benchmark of doing a full ast parse (as opposed to the top-level we do currently)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python_models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants