Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ORC support to GCSToBigQueryOperator #12556

Closed
juneoh opened this issue Nov 23, 2020 · 10 comments
Closed

Add ORC support to GCSToBigQueryOperator #12556

juneoh opened this issue Nov 23, 2020 · 10 comments
Labels
kind:feature Feature Requests provider:google Google (including GCP) related issues won't fix

Comments

@juneoh
Copy link

juneoh commented Nov 23, 2020

Description

BigQuery API for loading data lists ORC file format as one of the supported formats, but it is currently not allowed in GCSToBigQueryOperator. Specifying ORC as the source format raises a ValueError in BigQueryHook.

Use case / motivation

Include ORC file format in BigQueryHook so that GCSToBigQueryOperator will fully allow all file formats supported by the BigQuery API.

@juneoh juneoh added the kind:feature Feature Requests label Nov 23, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Nov 23, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@turbaszek
Copy link
Member

turbaszek commented Nov 23, 2020

As mentioned in PR the run_load method is deprecated and users should use insert_job method which supports everything possible. That said, the change we should do is to update GCSToBigQueryOperator to use new method instead of the old one, see #10288 (the PR seems to be in there but needs some more attention)

@turbaszek turbaszek added the provider:google Google (including GCP) related issues label Nov 23, 2020
@mik-laj
Copy link
Member

mik-laj commented Dec 22, 2020

@juneoh Are you willing to submit a PR?

@juneoh
Copy link
Author

juneoh commented Dec 22, 2020

Sure, I'll take a look during the holidays.

@juneoh
Copy link
Author

juneoh commented Jan 7, 2021

I've learned that holiday productivity is an illusion. I'll check in again next week as I'm ramping back up.

@juneoh
Copy link
Author

juneoh commented Jan 15, 2021

Started working on this. Upon studying the history, looks like we need to fix the both paths in execute() for BigQuery and external tables to stop using the deprecated methods, which will likely to fix #10288 and #12329 along the way.

@juneoh
Copy link
Author

juneoh commented Jan 27, 2021

Still working on this. The basic implementation is done but it's mostly a copy-paste from the BigQuery hook, so I'm trying to see what additional refactoring and testing is possible with the leverage of BigQuery Python SDK.

@eladkal
Copy link
Contributor

eladkal commented Sep 20, 2021

@juneoh are you still working on it?

@eladkal
Copy link
Contributor

eladkal commented Jul 12, 2022

The fix needed (while small) is in a deprecated function of the hook.
The function is deprecated for almost 2 years I believe we are closer to remove the function rather than fixing a bug in it.
Also no one tried to address this issue for very long time so I guess most users already moved away from the deprecated function.

I'm closing this issue as won't fix

@eladkal eladkal closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2022
@ericxiao251
Copy link

ericxiao251 commented Jan 23, 2024

Hi @eladkal, I would love to take this issue up - I just faced this issue at work where we store all our data in ORC. I read the comments, briefly and understand that initially, the suggestions were to put the fix in a function/hook that is now deprecated.

What would be the approach to adding this feature today? Would you be able to point to the file/function in Airflow I should be making this change?

Seems like we will need to provide some of the same logical type mapping that is supported with AVRO as well: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc#orc_conversions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests provider:google Google (including GCP) related issues won't fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants