Allow PARQUET format for uploading data. #609

apalacio9502 · 2024-05-18T03:09:21Z

This pull request contains the implementation for allowing the user to decide in which format they want to transmit the data (JSON or PARQUET) to BigQuery (For large amounts of data, loading data in JSON format is very time-consuming due to the size of the data that needs to be transmitted. To address this problem, BigQuery accepts other file formats, including Parquet). The most significant change enabling PARQUET data transmission is that the uploadType is no longer multipart; it is now resumable.

https://cloud.google.com/bigquery/docs/reference/api-uploads

Regards,

…age depends on R > 4.0.0

…at = 'PARQUET'

hadley

Thanks for working on this!

.github/workflows/R-CMD-check.yaml

R/bq-perform.R

R/bq-request.R

R/bq-perform.R

…equired.

Co-authored-by: Hadley Wickham <[email protected]>

apalacio9502 · 2024-05-27T18:10:55Z

Hi @hadley,

Thank you for your review. I have taken into account all of your comments, and I hope I haven't missed any.

Regards,

hadley

Two more small points. I really appreciate you working on this and your responsiveness to feedback 😄

hadley · 2024-05-30T22:10:37Z

NEWS.md

@@ -1,5 +1,14 @@
 # bigrquery (development version)

+## Significant improvements


Can you leave these headings off? We add them as part of the final release process?

hadley · 2024-05-30T22:11:11Z

DESCRIPTION

@@ -29,7 +29,8 @@ Imports:
    prettyunits,
    rlang (>= 1.1.0),
    tibble
-Suggests: 
+Suggests:
+    arrow,


Would you consider trying nanoparquet instead? It's very new but has no dependencies, so we could use it from imports.

I agree that nanoparquet is a good alternative because it has no dependencies. The only downside I see is that you would have to write to disk to read the raw data, as it lacks an output stream buffer implementation.

If you think the advantages outweigh the disadvantages, I could start testing and adapting the code.

Ah I didn't think about that, but I suspect it's still worthwhile given the lighter dependencies. Do you mind filing a nanoparquet issue to add a stream buffer output?

I will begin implementing nanoparquet. If BufferedOutputStream is added in the future, an update will be necessary.

r-lib/nanoparquet#31

apalacio9502 · 2024-06-03T18:45:21Z

Hi @hadley,

The implementation of Nanoparquet to replace Arrow has been completed. After several data loading tests, I believe it works very well.

I look forward to your comments.

Regards,

NEWS.md

hadley · 2024-06-04T16:28:41Z

Thanks so much for working on this!

Allow PARQUET format for uploading data.

eaab36a

apalacio9502 mentioned this pull request May 18, 2024

Improve bq_perform_upload to upload as PARQUET file #608

Closed

apalacio9502 and others added 8 commits May 20, 2024 18:16

Fixed typo in bq upload

fa79291

Add arrow to suggested packages

41d2151

Delete verbose post metadata in bq_upload

bbde49e

Drop R 3.6 check since we're losing it soon anyway and the arrow pack…

95d8c3d

…age depends on R > 4.0.0

Add importFrom httr PUT in bq_upload

509feb6

Update news

a45a72c

Remove the multipart functions, as they are no longer required.

b16fe81

Improve the message indicating that Arrow is required for source_form…

adab0bf

…at = 'PARQUET'

hadley reviewed May 24, 2024

View reviewed changes

apalacio9502 and others added 12 commits May 24, 2024 17:28

R 3.6 is no longer supported

fc8eea0

bq_perform_upload documentation specifies that the arrow package is r…

76fd963

…equired.

Update news

1e90f8f

Use arg_match instead of check_string in source_format

3196088

Use defer() insted on.exit()

41fae01

Co-authored-by: Hadley Wickham <[email protected]>

Use check_installed instead of requireNamespace

d7255b1

Style code

453e946

Co-authored-by: Hadley Wickham <[email protected]>

Implement BufferedOutputStream() to avoid writing to disk.

485a898

The bq_upload function is improved

774b62e

Delete a typo

14a7e29

arrow as a string in check_installed

af040e1

Update importFrom in bq_upload

3a49fe6

hadley reviewed May 30, 2024

View reviewed changes

apalacio9502 added 2 commits June 1, 2024 15:49

Remove headings from the development section.

f293253

Arrow package replaced by nanoparquet for Parquet files.

a68d651

hadley reviewed Jun 4, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Apply suggestions from code review

56f4207

hadley merged commit 3642c14 into r-dbi:main Jun 4, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow PARQUET format for uploading data. #609

Allow PARQUET format for uploading data. #609

apalacio9502 commented May 18, 2024 •

edited

Loading

hadley left a comment

apalacio9502 commented May 27, 2024

hadley left a comment

hadley May 30, 2024

hadley May 30, 2024

apalacio9502 May 31, 2024

hadley May 31, 2024

apalacio9502 May 31, 2024 •

edited

Loading

apalacio9502 commented Jun 3, 2024

hadley commented Jun 4, 2024

		@@ -1,5 +1,14 @@
		# bigrquery (development version)

		## Significant improvements

Allow PARQUET format for uploading data. #609

Allow PARQUET format for uploading data. #609

Conversation

apalacio9502 commented May 18, 2024 • edited Loading

hadley left a comment

Choose a reason for hiding this comment

apalacio9502 commented May 27, 2024

hadley left a comment

Choose a reason for hiding this comment

hadley May 30, 2024

Choose a reason for hiding this comment

hadley May 30, 2024

Choose a reason for hiding this comment

apalacio9502 May 31, 2024

Choose a reason for hiding this comment

hadley May 31, 2024

Choose a reason for hiding this comment

apalacio9502 May 31, 2024 • edited Loading

Choose a reason for hiding this comment

apalacio9502 commented Jun 3, 2024

hadley commented Jun 4, 2024

apalacio9502 commented May 18, 2024 •

edited

Loading

apalacio9502 May 31, 2024 •

edited

Loading