-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCSToBigQueryOperator - allow upload to existing table without specifying schema_fields/schema_object #12329
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
While this sounds like a good idea, I would recommend using |
Thats a good suggestion, with BashOperator and |
That would be good, but it will probably require to refactor this operator to use new methods, not the deprecated ones, see #10288 |
Hi @eladkal , I took a look at this issue and it seems like with this commit from @VladaZakharova a couple days ago, this is mostly working as expected. There is a small nit that is causing it not to work perfectly due to a check for self.autodetect being falsey as opposed to it being explicitly set to None. In fact, the Job docs from Google allude to it working this way but you kinda need to read between the lines a bit:
Once I patch this with PR 28564, it works fine. To verify, I tried this on my local setup with a simple dag: from airflow import DAG
from etsy.operators.gcs_to_bigquery import GCSToBigQueryOperator
DEFAULT_TASK_ARGS = {
"owner": "gcp-data-platform",
"retries": 1,
"retry_delay": 10,
"start_date": "2022-08-01",
}
with DAG(
max_active_runs=1,
concurrency=2,
catchup=False,
schedule_interval="@daily",
dag_id="test_os_patch_gcs_to_bigquery",
default_args=DEFAULT_TASK_ARGS,
) as dag:
test_gcs_to_bigquery = GCSToBigQueryOperator(
task_id="test_gcs_to_bigquery",
create_disposition="CREATE_IF_NEEDED",
# Need to explicitly set autodetect to None
autodetect=None,
write_disposition="WRITE_TRUNCATE",
destination_project_dataset_table="my-project.vchiapaikeo.test1",
bucket="my-bucket",
source_format="CSV",
source_objects=["vchiapaikeo/file.csv"],
) I then created a simple table in BigQuery: And ran the dag: Task logs:
^ omitted some redundant log lines PR: #28564 |
Description
We would like to be able to load data to existing BigQuery tables without having to specify schema_fields/schema_object in
GCSToBigQueryOperator
since table already exists.Use case / motivation
BigQuery load job usage details and problem explanation
We create BigQuery tables/datasets through CI process (terraform managed), with the help of Airflow we updating those tables with data.
To update tables with data we can use:
Airflow 2.0 operator: GCSToBigQueryOperator
Airflow 1.* operator (deprecated) GoogleCloudStorageToBigQueryOperator
However those operator require to specify one of 3 things:
In other cases it will:
Note: it does not actually says that
autodetect
must beTrue
in exception - but according to code it must be specified as True, or schema should be used otherwise.But we already have created table, and we can update it using
bq load
command. (which Airflow operators mentioned above are using internally)When using
bq load
- you also have an option to specify schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the--autodetect
flag instead of supplying a schema definition.https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bq
When you specify
--autodetect
as True - BigQuery will try to give random names to your columns, e.g.: 'string_field_0', 'int_field_1' - and if you are trying to load into existing table -bq load
will fail with error:'Cannot add fields (field: string_field_0)'}.'
Same way Airflow operators like 'GCSToBigQueryOperator' will fail.
However there is also an option NOT to specify
--autodetect
or specify--autodetect=False
and in this casebq load
will load from CloudStorage to existing BQ table without problems.Proposal/TL;DR:
Add an option not to specify
--autodetect
or specify--autodetect=False
when write_disposition='WRITE_APPEND' is used in GCSToBigQueryOperator. This will allow an operator to update existing BigQuery table without having to specify schema within the operator itself (it will just be updating existing table with data).The text was updated successfully, but these errors were encountered: