Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [1.4.0a] Seeds aren't working when partial parsing #880

Closed
dwreeves opened this issue Mar 6, 2024 · 12 comments
Closed

[Bug] [1.4.0a] Seeds aren't working when partial parsing #880

dwreeves opened this issue Mar 6, 2024 · 12 comments
Assignees
Labels
area:dependencies Related to dependencies, like Python packages, library versions, etc bug Something isn't working dbt:seed Primarily related to dbt seed command or functionality epic-assigned execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc priority:high High priority issues are blocking or critical issues without a workaround and large impact
Milestone

Comments

@dwreeves
Copy link
Collaborator

dwreeves commented Mar 6, 2024

Not sure what's going on but after switching to 1.4.0a, my seeds stopped working.

[2024-03-05, 00:47:29 UTC] {log.py:232} WARNING - [�[34m2024-03-05T00:47:29.333+0000�[0m] {�[34msubprocess.py:�[0m94} INFO�[0m - �[35m(astronomer-cosmos)�[0m - [Errno 2] No such file or directory: 'dags/dbt/seeds/us_states_seed.csv'�[0m

Really weird.

There is a hint of what is going on. My dbt directory is dags/dbt, and relative to that it should be looking for seeds/us_states_seed.csv. However, it seems to literally be looking for dags/dbt/seeds/us_states_seed.csv inside the dags/dbt folder, meaning it's looking for dags/dbt/dags/dbt/seeds/us_states_seed.csv.

I don't have time to look into this for the next few days, I'll see this weekend what's going on.

dbt-core version: dbt-core==1.6.4
Astronomer Cosmos version: astronomer-cosmos==1.4.0a1
Python: Python 3.11.6

Copy link

dosubot bot commented Mar 6, 2024

Based on the error message you provided, it seems like the system is unable to locate the us_states_seed.csv file in your dbt project. This could be due to a few reasons:

  1. Incorrect Seed File Location: The system is looking for the us_states_seed.csv file in the dags/dbt/seeds directory. Please verify that the file is located in this directory. If your seeds directory is located elsewhere or named differently, you'll need to adjust the structure or the operator's project_dir accordingly.

  2. Environment Variable Misconfiguration: If you're using the DBT_ROOT_PATH environment variable to set your dbt project's root path, ensure that it's correctly set in your environment and points to the directory that contains your dbt project. You can debug this by adding a print statement before the DbtSeedOperator instantiation to log the resolved DBT_ROOT_PATH.

  3. File Access Permissions: Ensure that the file permissions for us_states_seed.csv and its parent directories allow the user running the Airflow worker to read the file. Incorrect permissions can lead to a file not being accessible, resulting in a 'No such file or directory' error.

  4. Incorrect File Naming and References: Double-check the naming of your seed file and its reference in your dbt project's seed configuration. A typo or mismatch in the file name or path could cause dbt to look for the file in the wrong location.

If after these checks the issue persists, consider manually specifying the full path to the seed file in your environment or script to rule out path resolution issues, and ensure that your dbt project's structure is correctly set up to include the seeds directory in the expected location relative to the project_dir.

Regarding how Cosmos handles file path resolution, it creates symbolic links (symlinks) to the dbt project files and copies the .msgpack file used for partial parsing. This is done to ensure that operations within a temporary directory reflect the structure and necessary files of the original dbt project directory, without directly modifying the original files. You can find more details about this in the cosmos/dbt/project.py file.

I hope this helps! If you have any further questions or if the issue persists, please provide more details about your dbt project structure and how you're setting up and running your dbt and Airflow tasks.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:dependencies Related to dependencies, like Python packages, library versions, etc dbt:seed Primarily related to dbt seed command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc labels Mar 6, 2024
@tatiana tatiana added bug Something isn't working priority:high High priority issues are blocking or critical issues without a workaround and large impact labels Mar 6, 2024
@tatiana
Copy link
Collaborator

tatiana commented Mar 6, 2024

@dwreeves, thanks for pointing this out. Could you confirm which configurations you're using to run this? I'll reproduce and try to get it fixed it until tomorrow

@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 6, 2024

@tatiana Oh my god I figured it out and it's kind of funny and unfortunate.

Basically, the partial parse saves the location of the seed file. So if you use the partial parse file, but run in a different directory (aka tmp/something), it checks for the seed file at the location it saw it at when the partial parsing occurred.

There are a few potential avenues to solve the issue.

  • I think changing the "cwd=" of the subprocess to be the Airflow home directory might work, and if so it should give you the best of both worlds-- you get to partial parse when running seeds and also the seeds are found just fine. This would be my recommended solution along with passing --project-dir explicitly. I don't know if this would have any potential downsides, though. For typical users it shouldn't. But who knows what people are doing. For safety I'd suggest adding an operator kwarg that is like, set_cwd_to_tmp_dir: bool, so users who experience issues can override to the previous behavior.
  • If that doesn't work, another option is turning off partial parsing for seed and build.

@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 6, 2024

Actually let me think for a second about what the solution should be for the API. Setting the cwd as the AIRFLOW_HOME would be bad for users who run "cd dags/dbt && dbt compile", and I imagine some users would. A more generic "dbt_working_dir" kwarg could make more sense.

@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 6, 2024

In more positive news... this does confirm that I am successfully using the partial_parse.msgpack in Cosmos 1.4.0a 😉 Although it's a distressing way to figure out that it worked.

@dwreeves dwreeves changed the title [Bug] [1.4.0a] Seeds aren't working [Bug] [1.4.0a] Seeds aren't working when partial parsing Mar 6, 2024
@tatiana
Copy link
Collaborator

tatiana commented Mar 7, 2024

@dwreeves would it make more sense for us to solve the problem when we generate the partial_parse.msgpack?
How did you generate it?

@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 8, 2024

@dwreeves would it make more sense for us to solve the problem when we generate the partial_parse.msgpack?
How did you generate it?

Sigh, it's harder than I was hoping it would be.

Here are the instructions I have on how to get partial parsing working in 1.4.0 (it's a draft snippet from a blog I'm writing on this and a few other Cosmos topics):

For now, you must follow these directions exactly or else it will not work! Sorry.

  1. Go to one of your DbtOperator task logs in Airflow, and copy the profile YAML you see in the logs.
    It will look something like this (example of a Snowflake profile):

    my_profile_name:
        outputs:
            airflow_target:
                account: myaccount
                database: mydatabase
                password: '{{ env_var(''COSMOS_CONN_SNOWFLAKE_PASSWORD'') }}'
                # ... etc. ...
        target: my_target_name
  2. Add this file to your Airflow project somewhere, e.g. dags/dbt/profile/profiles.yml.

  3. In your CICD, set the DBT_PROFILES_DIR env var to the folder you placed that file, e.g. dags/dbt/profile.

  4. Add --profile [my_profile_name] --target [my_target_name] to the dbt parse command you run in your CICD (replace [my_profile_name] and [my_target_name] with the appropriate names). You must add these flags and you cannot just rely on defaults. This is because dbt partial parsing does a checksum on a subset of the flags used in the CLI, and both --profile and --target are included in the checksum. Failing to add these flags means it won't work.

    dbt deps
    dbt parse --profile [my_profile_name] --target [my_target_name]
  5. You must set up your CICD to run the exact same connection you use in Cosmos, meaning that you need to set COSMOS_CONN_SNOWFLAKE_PASSWORD to be the password you use in production. Again, similar deal as above: dbt does checksums when parsing, and every call to the env_var() macro creates a side-effect where the key-value pair of the env var is eventually put into the hash.

  6. Any --vars and env vars you use must also be included the exact same way you use them in Airflow. You cannot rely on DagRun-specific or DAG-specific parameterized vars / env vars, otherwise partial parsing won't work. (The 1.5.0 or later updates to partial parsing support should address this limitation, though.)

If you do not follow these steps exactly, you will see something like this in your logs:

Unable to do partial parsing because config vars, config profile, or config target have changed
Unable to do partial parsing because profile has changed
Unable to do partial parsing because env vars used in profiles.yml have changed

Or if something went really wrong, you'll see this:

Unable to do partial parsing because saved manifest not found. Starting full parse.

If you do not see any of these messages in your logs, congrats, you have successfully made partial parsing work! 😊😊😊

@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 8, 2024

Can we release 1.4.0a2? I want to see if --project-dir [tmpdir] works to solve this issue. (Aka what I pushed in #873). I have a feeling it might.

I would use the node_converters feature to test this change early, but I find that feature terribly cumbersome and hard to use and I'm struggling to get it to work, I'm sorry to say. It would be a lot easier to just do a second alpha release.

@tatiana
Copy link
Collaborator

tatiana commented May 17, 2024

@dwreeves When you have a chance, please, could you confirm if this issue was solved in 1.4.0?

@tatiana tatiana added this to the 1.5.0 milestone May 17, 2024
@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@tatiana tatiana mentioned this issue May 17, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jun 6, 2024

Hi @dwreeves did you have a chance of checking if this still happens in 1.4.1?

@dwreeves
Copy link
Collaborator Author

dwreeves commented Jun 6, 2024

Not yet. I believe that adding --project-dir [tmpdir] should have fixed it though, but haven't fully confirmed.

@dwreeves
Copy link
Collaborator Author

Yep, it's good, the fix worked. Closing as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dependencies Related to dependencies, like Python packages, library versions, etc bug Something isn't working dbt:seed Primarily related to dbt seed command or functionality epic-assigned execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc priority:high High priority issues are blocking or critical issues without a workaround and large impact
Projects
None yet
Development

No branches or pull requests

2 participants