🎉 BigQuery destinations with partitionned/clustered keys #7240

ChristopheDuong · 2021-10-21T12:17:31Z

What

Follow-up on community contribution from @andresbravog: #7141
Adding docs and releasing #7118 as well

Closes #2579
Closes #5959 by reverting the column type back to timestamp

How

Describe the solution

Pre-merge Checklist

Expand the relevant checklist and delete the others.

Updating a connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here

…ted_at field (#7141)

ChristopheDuong · 2021-10-21T13:12:08Z

/test connector=destination-bigquery

🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1368059530
❌ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1368059530
🐛 https://gradle.com/s/hq7kscscdheuy
🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1368059530
❌ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1368059530
🐛 https://gradle.com/s/gdj5ub4edfpq4

ChristopheDuong · 2021-10-21T20:44:41Z

I still need to fix integrations tests and add extra ones to verify the partition migration of existing non-partitioned tables in the destination into partitioned ones.

sherifnada

@ChristopheDuong could you fill out the connector checklist? The one thing that stood out is we should report this change in the .md file for BQ connectors to describe the change

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

sherifnada · 2021-10-22T03:51:03Z

airbyte-integrations/connectors/source-mongodb-strict-encrypt/build.gradle

@@ -16,6 +16,7 @@ dependencies {
    implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs)
    implementation project(':airbyte-integrations:connectors:source-relational-db')
    implementation project(':airbyte-integrations:connectors:source-mongodb-v2')
+    implementation 'org.mongodb:mongodb-driver-sync:4.3.0'


how come this was added here? was it just missing from before?

I don't know how you build the project in the IDE but when I "Build project" it does a global ./gradlew build

and so, without this line (master branch), it fails on

sherifnada · 2021-10-22T03:51:45Z

airbyte-integrations/connectors/destination-bigquery/build.gradle

@@ -13,6 +13,11 @@ dependencies {
    implementation 'com.google.cloud:google-cloud-bigquery:1.122.2'
    implementation 'org.apache.commons:commons-lang3:3.11'

+    // csv
+    implementation 'com.amazonaws:aws-java-sdk-s3:1.11.978'


why were these added?

implementation 'org.apache.commons:commons-csv:1.4' is needed by writeRecordToCsv to use the CsvPrinter similarly to how it is used in destination-gcs

the rest seems unnecessary...

Is that correct @andresbravog?

That is correct! On my tests, it was not compiling without the other libs.

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

sherifnada · 2021-10-22T04:18:52Z

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

+          // Copying data from partitioned tmp table into a non-partitioned table does not make it
+          // partitioned... we need to force re-create from 0...
+          bigquery.delete(destinationTableId);
+          copyTable(bigquery, tmpPartitionTableId, destinationTableId, WriteDisposition.WRITE_EMPTY);


shouldn't we be copying data from the destinationTableId into tmpPartitionedTableId instead of the other way around?

first, it does a create table ... as select ... from which is a copy from destinationTableId into tmpPartitionedTableId (that does converts into Partitioned from Unpartitioned). But BigQuery copy jobs don't transfer table partition modes when copying (thus the SQL approach instead):

Finally, we simply need to "rename" the tmp back to destinationTableId (and make a last simple delete/copy for that)

I see, so we:

Create an empty partitioned tmpPartitionedTable as select * from destinationTable

delete destination table

copy tmpPartitionedTable into destination table

copy new data from the tmp table (this tmp contains new data from the sync) into the destinationTable

ChristopheDuong · 2021-10-22T07:50:36Z

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

+  protected void writeRecordToCsv(final GcsCsvWriter gcsCsvWriter, final AirbyteRecordMessage recordMessage) {
+    // Bigquery represents TIMESTAMP to the microsecond precision, so we convert to microseconds then
+    // use BQ helpers to string-format correctly.
+    final long emittedAtMicroseconds = TimeUnit.MICROSECONDS.convert(recordMessage.getEmittedAt(), TimeUnit.MILLISECONDS);


In this other PR, it's converted into seconds instead of microseconds? These changes should be somehow equivalent, right?

WDYT @etsybaev @andresbravog?
https://github.com/airbytehq/airbyte/pull/5981/files#r734312334

ChristopheDuong · 2021-10-22T09:15:36Z

/test connector=destination-bigquery

🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1371545426
✅ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1371545426
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

ChristopheDuong · 2021-10-22T10:46:40Z

/test connector=destination-bigquery-denormalized

🕑 destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1371844008
✅ destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1371844008
No Python unittests run

ChristopheDuong · 2021-10-22T12:36:40Z

docs/integrations/destinations/bigquery.md

@@ -148,7 +155,8 @@ Therefore, Airbyte BigQuery destination will convert any invalid characters into

 | Version | Date | Pull Request | Subject |
 | :--- | :--- | :--- | :--- |
-| 0.4.0 | 2021-10-04 | [\#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers |
+| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables |


Should this be a minor release, since it reverts the type back to timestamp?

Suggested change

| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables |

| 0.5.0 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables |

there's no functional difference (the app allows you to upgrade to whatever version) but that seems more "tidy"

sherifnada · 2021-10-25T02:23:05Z

...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

+          // Copying data from partitioned tmp table into a non-partitioned table does not make it
+          // partitioned... we need to force re-create from 0...
+          bigquery.delete(destinationTableId);
+          copyTable(bigquery, tmpPartitionTableId, destinationTableId, WriteDisposition.WRITE_EMPTY);


I see, so we:

Create an empty partitioned tmpPartitionedTable as select * from destinationTable

delete destination table

copy tmpPartitionedTable into destination table

copy new data from the tmp table (this tmp contains new data from the sync) into the destinationTable

sherifnada · 2021-10-25T02:24:28Z

docs/integrations/destinations/bigquery.md

@@ -148,7 +155,8 @@ Therefore, Airbyte BigQuery destination will convert any invalid characters into

 | Version | Date | Pull Request | Subject |
 | :--- | :--- | :--- | :--- |
-| 0.4.0 | 2021-10-04 | [\#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers |
+| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables |


there's no functional difference (the app allows you to upgrade to whatever version) but that seems more "tidy"

sherifnada · 2021-10-25T02:25:49Z

docs/integrations/destinations/bigquery.md

 * **Google BigQuery client chunk size**: Google BigQuery client's chunk\(buffer\) size \(MIN=1, MAX = 15\) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to [https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html)
+* **Transformation Priority**: configure the priority of queries run for transformations. Refer to [https://cloud.google.com/bigquery/docs/running-queries](https://cloud.google.com/bigquery/docs/running-queries). By default, Airbyte runs interactive query jobs on BigQuery, which means that the query is executed as soon as possible and count towards daily concurrent quotas and limits. If set to use batch query on your behalf, BigQuery starts the query as soon as idle resources are available in the BigQuery shared resource pool. This usually occurs within a few minutes. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once.


when configuring this option, the only impact on Airbyte's sync is it's longer right?

yes and it's probably cheaper with more concurrent load as a trade-off

sherifnada · 2021-10-25T02:26:11Z

...t-integration/java/io/airbyte/integrations/destination/bigquery/BigQueryDestinationTest.java

@@ -296,4 +305,68 @@ private void assertTmpTablesNotPresent(final List<String> tableNames) throws Int
        .collect(Collectors.toList());
  }

+  @Test
+  void testWritePartitionOverUnpartitioned() throws Exception {


love me a good test case 💪🏼

ChristopheDuong · 2021-10-25T08:36:23Z

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1380240051
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1380240051

ChristopheDuong · 2021-10-25T08:36:33Z

/publish connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1380240541
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1380240541

) * [ airbytehq#5959 ][ airbytehq#2579 ] Add support of partitioned tables by _airbyte_emitted_at field (airbytehq#7141) Co-authored-by: Andrés Bravo <[email protected]>

Fix build and format code

121d744

ChristopheDuong marked this pull request as draft October 21, 2021 12:17

github-actions bot added the area/connectors Connector related issues label Oct 21, 2021

[ #5959 ][ #2579 ] Add support of partitioned tables by _airbyte_emit…

6d02bf5

…ted_at field (#7141)

ChristopheDuong temporarily deployed to more-secrets October 21, 2021 13:07 Inactive

jrhizor temporarily deployed to more-secrets October 21, 2021 13:14 Inactive

jrhizor temporarily deployed to more-secrets October 21, 2021 14:16 Inactive

implement "migrations" to partitioned tables

019e95d

ChristopheDuong temporarily deployed to more-secrets October 21, 2021 19:40 Inactive

Handle any bq schema when creating partitioned table

ee8207c

ChristopheDuong temporarily deployed to more-secrets October 21, 2021 20:36 Inactive

ChristopheDuong added 2 commits October 21, 2021 22:41

Format

cd6437c

Merge remote-tracking branch 'origin/master' into bq-partitions

d4ab94e

ChristopheDuong marked this pull request as ready for review October 21, 2021 20:43

ChristopheDuong requested review from sherifnada and tuliren October 21, 2021 20:43

ChristopheDuong temporarily deployed to more-secrets October 21, 2021 20:43 Inactive

format code

392c41c

ChristopheDuong temporarily deployed to more-secrets October 21, 2021 20:51 Inactive

sherifnada reviewed Oct 22, 2021

View reviewed changes

ChristopheDuong commented Oct 22, 2021

View reviewed changes

Apply changes from code review

e3aa2a5

ChristopheDuong temporarily deployed to more-secrets October 22, 2021 09:07 Inactive

ChristopheDuong requested a review from etsybaev October 22, 2021 09:15

jrhizor temporarily deployed to more-secrets October 22, 2021 09:17 Inactive

jrhizor temporarily deployed to more-secrets October 22, 2021 10:48 Inactive

Add tests for partition migration

ba9767d

ChristopheDuong temporarily deployed to more-secrets October 22, 2021 12:02 Inactive

Add docs

ea4caeb

github-actions bot added the area/documentation Improvements or additions to documentation label Oct 22, 2021

ChristopheDuong temporarily deployed to more-secrets October 22, 2021 12:32 Inactive

ChristopheDuong commented Oct 22, 2021

View reviewed changes

sherifnada approved these changes Oct 25, 2021

View reviewed changes

ChristopheDuong added 3 commits October 25, 2021 10:29

bumpversion of bigquery connectors

bb7f7e5

Merge remote-tracking branch 'origin/master' into bq-partitions

c0ba015

format code

20a8334

ChristopheDuong temporarily deployed to more-secrets October 25, 2021 08:33 Inactive

jrhizor temporarily deployed to more-secrets October 25, 2021 08:38 Inactive

ChristopheDuong temporarily deployed to more-secrets October 25, 2021 09:13 Inactive

ChristopheDuong merged commit 27df558 into master Oct 25, 2021

ChristopheDuong deleted the chris/bq-partitions branch October 25, 2021 10:41

jrhizor mentioned this pull request Nov 1, 2021

Bump Airbyte version from 0.30.23-alpha to 0.30.24-alpha #7532

Merged

yevhenii-ldv mentioned this pull request Nov 5, 2021

Source Hubspot: problem with normalization in items_properties #6592

Closed

karinakuz added connectors/destination/bigquery connectors/destinations-api connectors/destinations-warehouse and removed connectors/destinations-api labels Jan 12, 2022

gvillafanetapia mentioned this pull request Apr 5, 2023

Destination Postgres: Very poor performance #21557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 BigQuery destinations with partitionned/clustered keys #7240

🎉 BigQuery destinations with partitionned/clustered keys #7240

ChristopheDuong commented Oct 21, 2021 •

edited

Loading

ChristopheDuong commented Oct 21, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 21, 2021

sherifnada left a comment

sherifnada Oct 22, 2021

ChristopheDuong Oct 22, 2021 •

edited

Loading

sherifnada Oct 22, 2021

ChristopheDuong Oct 22, 2021

andresbravog Oct 25, 2021 •

edited

Loading

sherifnada Oct 22, 2021

ChristopheDuong Oct 22, 2021 •

edited

Loading

sherifnada Oct 25, 2021

ChristopheDuong Oct 22, 2021 •

edited

Loading

ChristopheDuong commented Oct 22, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 22, 2021 •

edited by github-actions bot

Loading

ChristopheDuong Oct 22, 2021

sherifnada Oct 25, 2021

sherifnada Oct 25, 2021

sherifnada Oct 25, 2021

sherifnada Oct 25, 2021

ChristopheDuong Oct 25, 2021

sherifnada Oct 25, 2021

ChristopheDuong commented Oct 25, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 25, 2021 •

edited by github-actions bot

Loading

	\| 0.4.2 \| 2021-10-26 \| [\#7240](https://github.com/airbytehq/airbyte/issues/7240) \| Output partitioned/clustered tables \|
	\| 0.5.0 \| 2021-10-26 \| [\#7240](https://github.com/airbytehq/airbyte/issues/7240) \| Output partitioned/clustered tables \|

		* Google BigQuery client chunk size: Google BigQuery client's chunk\(buffer\) size \(MIN=1, MAX = 15\) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to [https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html)
		* Transformation Priority: configure the priority of queries run for transformations. Refer to [https://cloud.google.com/bigquery/docs/running-queries](https://cloud.google.com/bigquery/docs/running-queries). By default, Airbyte runs interactive query jobs on BigQuery, which means that the query is executed as soon as possible and count towards daily concurrent quotas and limits. If set to use batch query on your behalf, BigQuery starts the query as soon as idle resources are available in the BigQuery shared resource pool. This usually occurs within a few minutes. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once.

🎉 BigQuery destinations with partitionned/clustered keys #7240

🎉 BigQuery destinations with partitionned/clustered keys #7240

Conversation

ChristopheDuong commented Oct 21, 2021 • edited Loading

What

How

Recommended reading order

Pre-merge Checklist

Community member or Airbyter

Airbyter

ChristopheDuong commented Oct 21, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Oct 21, 2021

sherifnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Oct 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andresbravog Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Oct 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Oct 22, 2021 • edited Loading

Choose a reason for hiding this comment

ChristopheDuong commented Oct 22, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Oct 22, 2021 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong commented Oct 25, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Oct 25, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Oct 21, 2021 •

edited

Loading

ChristopheDuong commented Oct 21, 2021 •

edited by github-actions bot

Loading

ChristopheDuong Oct 22, 2021 •

edited

Loading

andresbravog Oct 25, 2021 •

edited

Loading

ChristopheDuong Oct 22, 2021 •

edited

Loading

ChristopheDuong Oct 22, 2021 •

edited

Loading

ChristopheDuong commented Oct 22, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 22, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 25, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Oct 25, 2021 •

edited by github-actions bot

Loading