Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryGetDataOperator does not respect project_id parameter #30635

Closed
2 tasks done
ying-w opened this issue Apr 14, 2023 · 7 comments · Fixed by #30651
Closed
2 tasks done

BigQueryGetDataOperator does not respect project_id parameter #30635

ying-w opened this issue Apr 14, 2023 · 7 comments · Fixed by #30651
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@ying-w
Copy link
Contributor

ying-w commented Apr 14, 2023

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==8.11.0
google-cloud-bigquery==2.34.4

Apache Airflow version

2.5.2+astro.2

Operating System

OSX

Deployment

Astronomer

Deployment details

No response

What happened

When setting a project_id parameter for BigQueryGetDataOperator the default project from env is not overwritten. Maybe something broke after it was added in? #25782

What you think should happen instead

Passing in as parameter should take precedence over reading in from environment

How to reproduce

Part1

from airflow.providers.google.cloud.operators.bigquery import BigQueryGetDataOperator

bq = BigQueryGetDataOperator(
  task_id=f"my_test_query_task_id",
  gcp_conn_id="bigquery",
  table_id="mytable",
  dataset_id="mydataset",
  project_id="my_non_default_project",
)
f2 = bq.execute(None)

in env i have set

AIRFLOW_CONN_BIGQUERY=gcpbigquery://
GOOGLE_CLOUD_PROJECT=my_primary_project
GOOGLE_APPLICATION_CREDENTIALS=/usr/local/airflow/gcloud/application_default_credentials.json

The credentials json file doesn't have project

Part2

Unsetting GOOGLE_CLOUD_PROJECT and rerunning results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/cloud/operators/bigquery.py", line 886, in execute
    schema: dict[str, list] = hook.get_schema(
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/common/hooks/base_google.py", line 463, in inner_wrapper
    raise AirflowException(
airflow.exceptions.AirflowException: The project id must be passed either as keyword project_id parameter or as project_id extra in Google Cloud connection definition. Both are not set!

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ying-w ying-w added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 14, 2023
@eladkal
Copy link
Contributor

eladkal commented Apr 14, 2023

@sudohainguyen can you take a look?

@eladkal eladkal added provider:google Google (including GCP) related issues good first issue and removed needs-triage label for new issues that we didn't triage yet labels Apr 14, 2023
@nitinpandey-154
Copy link

nitinpandey-154 commented Apr 14, 2023

In Part1, Do you mean project_id when invoking BigQueryGetDataOperator?

@sudohainguyen
Copy link
Contributor

I guess you meant the billing project for the query execution, you should use location keyword param. project_id is where your table is located at

@ying-w
Copy link
Contributor Author

ying-w commented Apr 14, 2023

In Part1, Do you mean project_id when invoking BigQueryGetDataOperator?

yes sorry, just edited code sample. project_id= parameter

the billing project for the query execution, you should use location keyword param

Setting location='us' doesn't do anything

@sudohainguyen
Copy link
Contributor

Try setting location=my_primary_project 🤔

@ying-w
Copy link
Contributor Author

ying-w commented Apr 14, 2023

I'm trying to query table in my_non_default_project when I have GOOGLE_CLOUD_PROJECT=my_primary_project set in env.

With GOOGLE_CLOUD_PROJECT=my_primary_project set and with location=my_non_default_project and project_id=my_non_default_project: same error where the GET url is using my_primary_project rather than my_non_default_project

Without GOOGLE_CLOUD_PROJECT=my_primary_project set and with location=my_non_default_project and project_id=my_non_default_project: same error where project_id is not found

Maybe because it's using hook.project_id rather than self.project_id in _submit_job()?

@ying-w
Copy link
Contributor Author

ying-w commented Apr 14, 2023

@sudohainguyen i think i found it, hook.get_schema() needs project_id=self.project_id

schema: dict[str, list] = hook.get_schema(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants