-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCSToBigQueryOperator does not respect the destination project ID #29958
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Thank you @chriscugliotta for documenting this very annoying bug. |
Please assign to me |
assigned. |
Currently I am in progress. But should I write something here, or just make a pr after completing? |
@Yaro1 you could raise a PR and put the text in description |
okay, got it, thanks |
Thank you, @Yaro1! |
My pleasure :) |
Apache Airflow Provider(s)
google
Versions of Apache Airflow Providers
apache-airflow-providers-google==8.10.0
Apache Airflow version
2.3.4
Operating System
Ubuntu 18.04.6 LTS
Deployment
Google Cloud Composer
Deployment details
Google Cloud Composer 2.1.2
What happened
GCSToBigQueryOperator
does not respect the BigQuery project ID specified indestination_project_dataset_table
argument. Instead, it prioritizes the project ID defined in the Airflow connection.What you think should happen instead
The project ID specified via
destination_project_dataset_table
should be respected.Use case: Suppose our Composer environment and service account (SA) live in
project-A
, and we want to transfer data into foreign projectsB
,C
, andD
. We don't have credentials (and thus don't have Airflow connections defined) for projectsB
,C
, andD
. Instead, all transfers are executed by our singular SA inproject-A
. (Assume this SA has cross-project IAM policies). Thus, we want to use a single SA and single Airflow connection (i.e.gcp_conn_id=google_cloud_default
) to send data into 3+ destination projects. I imagine this is a fairly common setup for sending data across GCP projects.Root cause: I've been studying the source code, and I believe the bug is caused by line 309. Experimentally, I have verified that
hook.project_id
traces back to the Airflow connection's project ID. If no destination project ID is explicitly specified, then it makes sense to fall back on the connection's project. However, if the destination project is explicitly provided, surely the operator should honor that. I think this bug can be fixed by amending line 309 as follows:This pattern is used successfully in many other areas of the repo: example.
How to reproduce
Admittedly, this bug is difficult to reproduce, because it requires two GCP projects, i.e. a service account in
project-A
, and inbound GCS files and a destination BigQuery table inproject-B
. Also, you need an Airflow server with agoogle_cloud_default
connection that points toproject-A
like this. Assuming all that exists, the bug can be reproduced via the following Airflow DAG:Stack trace:
From the stack trace, notice the operator is (incorrectly) attempting to insert into
project-A
rather thanproject-B
.Anything else
Perhaps out-of-scope, but the inverse direction also suffers from this same problem, i.e. BigQueryToGcsOperator and line 192.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: