🎉 BigQuery destination: use serialized buffer for gcs staging #11776

tuliren · 2022-04-07T00:59:13Z

What

This PR resolves Apply buffering changes to BigQuery Destination when using staging #11203.
BigQuery GCS staging destination is migrated to use serialized file buffer to improve its scalability. This PR builds on top of Chris' marvelous buffered stream consumer change for Snowflake.

How

The BigQuery direct uploading destination is unchanged.
The GCS staging destination uses the BufferedStreamConsumer, and reuses some of the BigQuery uploader, BigQueryUtils and BigQueryRecordFormatter logic.
A new BigQueryStagingOperations interface is introduced to handle the staging operations involving the BigQuery client. It is similar to StagingOperations. But since BigQuery requires lots of unique logic, a new interface is necessary. The implementation wraps the behaviors from the BigQuery uploaders.

🚨 User Impact 🚨

None expected.

TODO

Fix integration test for BigQuery denormalized destination.
Remove deprecated uploader code.

Pre-merge Checklist

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Connector's bootstrap.md. See description and examples
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub by running the /publish command described here
After the new connector version is published, connector version bumped in the seed directory as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

tuliren · 2022-04-07T02:27:14Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/2106256610
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/2106256610
Python tests coverage:

Name                                                                                                                            Stmts   Miss  Cover
---------------------------------------------------------------------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                                                                                          2      0   100%
normalization/transform_catalog/reserved_keywords.py                                                                               13      0   100%
normalization/transform_catalog/__init__.py                                                                                         2      0   100%
normalization/destination_type.py                                                                                                  13      0   100%
normalization/__init__.py                                                                                                           4      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/airbyte_protocol.py     124      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/__init__.py               1      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/__init__.py                      2      0   100%
normalization/transform_catalog/destination_name_transformer.py                                                                   155      8    95%
normalization/transform_config/transform.py                                                                                       168     31    82%
normalization/transform_catalog/table_name_registry.py                                                                            174     34    80%
normalization/transform_catalog/utils.py                                                                                           33      7    79%
normalization/transform_catalog/catalog_processor.py                                                                              143     77    46%
normalization/transform_catalog/transform.py                                                                                       45     26    42%
normalization/transform_catalog/stream_processor.py                                                                               524    337    36%
---------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                                            1403    520    63%

tuliren · 2022-04-07T09:55:08Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/2108110638
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/2108110638
Python tests coverage:

Name                                                                                                                            Stmts   Miss  Cover
---------------------------------------------------------------------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                                                                                          2      0   100%
normalization/transform_catalog/reserved_keywords.py                                                                               13      0   100%
normalization/transform_catalog/__init__.py                                                                                         2      0   100%
normalization/destination_type.py                                                                                                  13      0   100%
normalization/__init__.py                                                                                                           4      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/airbyte_protocol.py     124      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/__init__.py               1      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/__init__.py                      2      0   100%
normalization/transform_catalog/destination_name_transformer.py                                                                   155      8    95%
normalization/transform_config/transform.py                                                                                       168     31    82%
normalization/transform_catalog/table_name_registry.py                                                                            174     34    80%
normalization/transform_catalog/utils.py                                                                                           33      7    79%
normalization/transform_catalog/catalog_processor.py                                                                              143     77    46%
normalization/transform_catalog/transform.py                                                                                       45     26    42%
normalization/transform_catalog/stream_processor.py                                                                               524    337    36%
---------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                                            1403    520    63%

tuliren · 2022-04-07T09:55:15Z

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/2108111466
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/2108111466
Python tests coverage:

Name                                                                                                                            Stmts   Miss  Cover
---------------------------------------------------------------------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                                                                                          2      0   100%
normalization/transform_catalog/reserved_keywords.py                                                                               13      0   100%
normalization/transform_catalog/__init__.py                                                                                         2      0   100%
normalization/destination_type.py                                                                                                  13      0   100%
normalization/__init__.py                                                                                                           4      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/airbyte_protocol.py     124      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/models/__init__.py               1      0   100%
/actions-runner/_work/airbyte/airbyte/airbyte-integrations/bases/airbyte-protocol/airbyte_protocol/__init__.py                      2      0   100%
normalization/transform_catalog/destination_name_transformer.py                                                                   155      8    95%
normalization/transform_config/transform.py                                                                                       168     31    82%
normalization/transform_catalog/table_name_registry.py                                                                            174     34    80%
normalization/transform_catalog/utils.py                                                                                           33      7    79%
normalization/transform_catalog/catalog_processor.py                                                                              143     77    46%
normalization/transform_catalog/transform.py                                                                                       45     26    42%
normalization/transform_catalog/stream_processor.py                                                                               524    337    36%
---------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                                            1403    520    63%

ChristopheDuong · 2022-04-07T10:01:02Z

airbyte-integrations/connectors/destination-bigquery/build.gradle

@@ -12,9 +12,8 @@ application {
 dependencies {


I forgot to mention this on the previous S3/GCS PR but can we make the integration tests of warehouses destinations (bigquery / snowflake / etc in the future) dependent from blob storage destinations that are used underneath for staging?

So if we run ./gradlew :airbyte-integrations:connectors:destination-gcs:integrationTest, it should also run: ./gradlew :airbyte-integrations:connectors:destination-bigquery*:integrationTest, etc?

Good idea. I created an issue here: #11815

ChristopheDuong · 2022-04-07T10:04:38Z

...src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryAvroSerializedBuffer.java

+/**
+ * This class differs from {@link AvroSerializedBuffer} in that 1) the Avro schema can be customized
+ * by the caller, and 2) the message is formatted by {@link BigQueryRecordFormatter}. In this way,
+ * this buffer satisfies the needs of both the standard and the denormalized BigQuery destinations.
+ */


Shouldn't we introduce a common concept to handle this though?

If I understand correctly, this sounds like the CsvSheetGenerator?

At this point, only BigQuery needs a separate implementation, because the BigQuery schema is a bit hacky. I doubt if other destinations will ever need this. So I think something more generic is not necessary. Maybe let's wait until there is another case use?

ChristopheDuong · 2022-04-07T10:07:06Z

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java

-    return getRecordConsumer(getUploaderMap(config, catalog), outputRecordCollector);
+    final UploadingMethod uploadingMethod = BigQueryUtils.getLoadingMethod(config);
+    if (uploadingMethod == UploadingMethod.STANDARD) {
+      return getStandardRecordConsumer(config, catalog, outputRecordCollector);


BTW, should there be some warning printed in the logs to warn users and recommend using staging options rather than direct upload methods though? (especially when dealing with larger streams)

(we could even suggest which format should be used to stage the data in the message if that's exposed to users?)

I guess it's always staging in avro...

Will do. I will also update the document to mention that.

ChristopheDuong · 2022-04-07T10:22:13Z

...c/main/java/io/airbyte/integrations/destination/bigquery/BigQueryStagingConsumerFactory.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BigQueryStagingConsumerFactory {


Do you think it would be possible to get a common staging consumer factory from snowflake and bigquery?

we probably don't want to have their own staging consumer factory for each destination moving forward, no? (redshift, etc)

This could definitely be another PR though

I did try to reuse the existing factory or modify the Snowflake factory to fit in BigQuery. However, it complicates the factory a lot, because BigQuery is different and it has its own client. I don't think we need another consumer factory for destinations like Redshift, because most destinations are JDBC based, and fit into the current StagingConsumerFactory.

ChristopheDuong · 2022-04-07T10:24:28Z

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryWriteConfig.java

+/**
+ * @param datasetId the dataset ID is equivalent to output schema
+ */
+public record BigQueryWriteConfig(


there might be opportunity to DRY this too?

This class also has many BigQuery-specific logic (e.g. TableId). So I'd rather not to merge it with the other write configs.

tuliren · 2022-04-07T20:55:40Z

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/2111504304
🚀 Successfully published connectors/destination-bigquery
❌ Couldn't auto-bump version for connectors/destination-bigquery

tuliren · 2022-04-07T20:55:46Z

/publish connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/2111504603
🚀 Successfully published connectors/destination-bigquery-denormalized
🚀 Auto-bumped version for connectors/destination-bigquery-denormalized
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/2111504603

Rebase bigquery changes to master

4e901f2

github-actions bot added the area/connectors Connector related issues label Apr 7, 2022

tuliren added 2 commits April 6, 2022 18:22

Add comments

7a586bd

Uncomment test code

dff4889

tuliren requested review from ChristopheDuong and edgao April 7, 2022 02:27

tuliren added 2 commits April 6, 2022 20:11

Format code

d731fe6

Bump versions

69a2ad4

github-actions bot added the area/documentation Improvements or additions to documentation label Apr 7, 2022

tuliren added 4 commits April 7, 2022 01:39

Fix denormalized destination target table name

cc523a8

Fix avro schema for denormalized destination

d02b3ad

Remove unnecessary params from consumer factory

ff1485f

Add back previous version

f9aa7a8

ChristopheDuong reviewed Apr 7, 2022

View reviewed changes

ChristopheDuong approved these changes Apr 7, 2022

View reviewed changes

tuliren requested a review from subodh1810 April 7, 2022 16:37

Add warning about standard mode

149f109

auto-bump connector version

0578f1a

octavia-squidington-iii temporarily deployed to more-secrets April 7, 2022 21:34 Inactive

Bump version for bigquery in seed

cbb0c98

tuliren merged commit 8bd2d9b into master Apr 7, 2022

tuliren deleted the liren/rebase-bigquery-changes branch April 7, 2022 23:59

tuliren temporarily deployed to more-secrets April 8, 2022 00:00 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 BigQuery destination: use serialized buffer for gcs staging #11776

🎉 BigQuery destination: use serialized buffer for gcs staging #11776

tuliren commented Apr 7, 2022 •

edited

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

ChristopheDuong Apr 7, 2022

tuliren Apr 7, 2022

ChristopheDuong Apr 7, 2022

tuliren Apr 7, 2022

ChristopheDuong Apr 7, 2022

ChristopheDuong Apr 7, 2022 •

edited

Loading

ChristopheDuong Apr 7, 2022

tuliren Apr 7, 2022

ChristopheDuong Apr 7, 2022 •

edited

Loading

tuliren Apr 7, 2022

ChristopheDuong Apr 7, 2022

tuliren Apr 7, 2022

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

🎉 BigQuery destination: use serialized buffer for gcs staging #11776

🎉 BigQuery destination: use serialized buffer for gcs staging #11776

Conversation

tuliren commented Apr 7, 2022 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

TODO

Pre-merge Checklist

Community member or Airbyter

Airbyter

tuliren commented Apr 7, 2022 • edited by github-actions bot Loading

tuliren commented Apr 7, 2022 • edited by github-actions bot Loading

tuliren commented Apr 7, 2022 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren commented Apr 7, 2022 • edited by github-actions bot Loading

tuliren commented Apr 7, 2022 • edited by github-actions bot Loading

tuliren commented Apr 7, 2022 •

edited

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

ChristopheDuong Apr 7, 2022 •

edited

Loading

ChristopheDuong Apr 7, 2022 •

edited

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading

tuliren commented Apr 7, 2022 •

edited by github-actions bot

Loading