Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSToBQ operator does not respect project_id in deferrable mode with impersonation chain. #32093

Closed
2 tasks done
nathadfield opened this issue Jun 23, 2023 · 12 comments
Closed
2 tasks done
Assignees
Labels
area:providers kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@nathadfield
Copy link
Collaborator

Apache Airflow version

2.6.2

What happened

When using the GCSToBigQueryOperator in deferrable mode with an impersonation_chain service account which has a default project_id that is different from the project_id specified in the operator arguments, a failure occurs.

[2023-06-23, 11:38:37 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py", line 447, in execute_complete
    raise AirflowException(event["message"])
airflow.exceptions.AirflowException: 404, message='Not Found: {\n  "error": {\n    "code": 404,\n    "message": "Not found: Job king-cdmr-etl-sandbox:airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd",\n    "errors": [\n      {\n        "message": "Not found: Job king-cdmr-etl-sandbox:airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd",\n        "domain": "global",\n        "reason": "notFound"\n      }\n    ],\n    "status": "NOT_FOUND"\n  }\n}\n', url=URL('https://www.googleapis.com/bigquery/v2/projects/king-cdmr-etl-sandbox/jobs/airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd')

I believe this happens because, although the BigQuery job to insert data, is raised against self.project_id in _submit_job, when in deferrable mode it tries to find the job within the project in self.hook.project_id.

It is possible that that the default project_id assigned to the impersonation chain service account is different to the project_id specified to the operator.

In the above error, you can see that the error says that it cannot find the job_id airflow_apptweak_king_itunes_connect_channels_load_active_devices_to_bq_2023_06_22T07_00_00_00_00_4842808969d21632ecbb76ffca48aabd in the project king-cdmt-etl-sandbox.

In fact this job_id was created successfully in the project king-coredatasets-sandbox

Screenshot 2023-06-23 at 12 40 39

What you think should happen instead

I think that we should modify the call to self.defer to receive self.project_id rather than self.hook.project_id

How to reproduce

I haven't quite got the exact steps to reproduce but I will submit a PR for review soon.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.0.0

Deployment

Astronomer

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nathadfield nathadfield added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 23, 2023
@hussein-awala hussein-awala added provider:google Google (including GCP) related issues area:providers and removed area:core needs-triage label for new issues that we didn't triage yet labels Jun 23, 2023
@bhagany
Copy link
Contributor

bhagany commented Jun 23, 2023

Thank you for submitting this -- I have been procrastinating filing an issue of my own about this inconsistent handling of project ids. I believe the problems with changes to GCSToBigQueryOperator and BigQueryToGCSOperator extend further than the case you've raised, and are caused by these changes. Would you be willing to talk about the larger set of issues as well? I see the PR you submitted and I think your change is a good one, but I think more are needed.

Somewhat inconveniently, I am only a few hours away from a 2-week vacation, so my responses may be delayed

@bhagany
Copy link
Contributor

bhagany commented Jun 23, 2023

^ I decided to file one, just to get my thoughts down

@nathadfield
Copy link
Collaborator Author

@bhagany Thanks for you comments and I'd be more than happy to talk more about this but there's no rush while you're away. This PR was really just me patching for a specific issue that we have to work around internally currently so it would be good to deal with these issues more comprehensively.

@nathadfield
Copy link
Collaborator Author

BigQueryGetDataOperator also has this same problem

@eladkal
Copy link
Contributor

eladkal commented Jul 8, 2023

@nathadfield @avinashpandeshwar
I guess #32232 did not cover GCSToBigQueryOperator BigQueryGetDataOperator
can you look into it and raise PR?

@avinashpandeshwar
Copy link
Contributor

@eladkal Fix to GCSToBigQueryOperator for deferrable mode was addressed in #32232, or did you mean for BigQueryGetDataOperator?

@eladkal
Copy link
Contributor

eladkal commented Jul 9, 2023

or did you mean for BigQueryGetDataOperator?

Yes sorry.. I meant to BigQueryGetDataOperator

@avinashpandeshwar
Copy link
Contributor

@eladkal This current issue seems to be solved via #32232 and could be closed. And I couldn't find another issue for BigQueryGetDataOperator specifically to link a PR. Should I create an issue first, or just provide a standalone PR here?

@eladkal
Copy link
Contributor

eladkal commented Jul 10, 2023

@eladkal This current issue seems to be solved via #32232 and could be closed. And I couldn't find another issue for BigQueryGetDataOperator specifically to link a PR. Should I create an issue first, or just provide a standalone PR here?

Just open a PR and link to this issue as related

@avinashpandeshwar
Copy link
Contributor

@nathadfield @bhagany For BigQueryGetDataOperator, seems the project_id param is reserved for table's (storage) project. This conflicts with the convention followed across other operators where project_id stands for compute (job) project. I was wondering a reasonable approach is to have a separate optional table_project_id to provide the storage project (defaults to hook project_id), and use project_id (defaults to hook project_id) to submit the job. Do let me know your thoughts on this.

cc @eladkal

@eladkal
Copy link
Contributor

eladkal commented Jul 10, 2023

I think it's best to raise a PR with what you believe is the best course of action and explain the pros/cons in the PR description. Its easier to have this discussion where we see the code and scope of changes

@eladkal
Copy link
Contributor

eladkal commented Jul 27, 2023

Resolved in #32488 and #32232

@eladkal eladkal closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants