-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 BigQuery destinations with partitionned/clustered keys #7240
Conversation
/test connector=destination-bigquery
|
I still need to fix integrations tests and add extra ones to verify the partition migration of existing non-partitioned tables in the destination into partitioned ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristopheDuong could you fill out the connector checklist? The one thing that stood out is we should report this change in the .md
file for BQ connectors to describe the change
...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java
Outdated
Show resolved
Hide resolved
@@ -16,6 +16,7 @@ dependencies { | |||
implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs) | |||
implementation project(':airbyte-integrations:connectors:source-relational-db') | |||
implementation project(':airbyte-integrations:connectors:source-mongodb-v2') | |||
implementation 'org.mongodb:mongodb-driver-sync:4.3.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come this was added here? was it just missing from before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -13,6 +13,11 @@ dependencies { | |||
implementation 'com.google.cloud:google-cloud-bigquery:1.122.2' | |||
implementation 'org.apache.commons:commons-lang3:3.11' | |||
|
|||
// csv | |||
implementation 'com.amazonaws:aws-java-sdk-s3:1.11.978' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why were these added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation 'org.apache.commons:commons-csv:1.4'
is needed by writeRecordToCsv
to use the CsvPrinter similarly to how it is used in destination-gcs
the rest seems unnecessary...
Is that correct @andresbravog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct! On my tests, it was not compiling without the other libs.
...query/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java
Show resolved
Hide resolved
// Copying data from partitioned tmp table into a non-partitioned table does not make it | ||
// partitioned... we need to force re-create from 0... | ||
bigquery.delete(destinationTableId); | ||
copyTable(bigquery, tmpPartitionTableId, destinationTableId, WriteDisposition.WRITE_EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be copying data from the destinationTableId into tmpPartitionedTableId instead of the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first, it does a create table ... as select ... from
which is a copy from destinationTableId into tmpPartitionedTableId (that does converts into Partitioned from Unpartitioned). But BigQuery copy jobs don't transfer table partition modes when copying (thus the SQL approach instead):
Finally, we simply need to "rename" the tmp back to destinationTableId (and make a last simple delete/copy for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so we:
- Create an empty partitioned tmpPartitionedTable as select * from destinationTable
- delete destination table
- copy tmpPartitionedTable into destination table
- copy new data from the tmp table (this tmp contains new data from the sync) into the destinationTable
protected void writeRecordToCsv(final GcsCsvWriter gcsCsvWriter, final AirbyteRecordMessage recordMessage) { | ||
// Bigquery represents TIMESTAMP to the microsecond precision, so we convert to microseconds then | ||
// use BQ helpers to string-format correctly. | ||
final long emittedAtMicroseconds = TimeUnit.MICROSECONDS.convert(recordMessage.getEmittedAt(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this other PR, it's converted into seconds instead of microseconds? These changes should be somehow equivalent, right?
WDYT @etsybaev @andresbravog?
https://github.com/airbytehq/airbyte/pull/5981/files#r734312334
/test connector=destination-bigquery
|
/test connector=destination-bigquery-denormalized
|
@@ -148,7 +155,8 @@ Therefore, Airbyte BigQuery destination will convert any invalid characters into | |||
|
|||
| Version | Date | Pull Request | Subject | | |||
| :--- | :--- | :--- | :--- | | |||
| 0.4.0 | 2021-10-04 | [\#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers | | |||
| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a minor release, since it reverts the type back to timestamp
?
| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables | | |
| 0.5.0 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no functional difference (the app allows you to upgrade to whatever version) but that seems more "tidy"
// Copying data from partitioned tmp table into a non-partitioned table does not make it | ||
// partitioned... we need to force re-create from 0... | ||
bigquery.delete(destinationTableId); | ||
copyTable(bigquery, tmpPartitionTableId, destinationTableId, WriteDisposition.WRITE_EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so we:
- Create an empty partitioned tmpPartitionedTable as select * from destinationTable
- delete destination table
- copy tmpPartitionedTable into destination table
- copy new data from the tmp table (this tmp contains new data from the sync) into the destinationTable
@@ -148,7 +155,8 @@ Therefore, Airbyte BigQuery destination will convert any invalid characters into | |||
|
|||
| Version | Date | Pull Request | Subject | | |||
| :--- | :--- | :--- | :--- | | |||
| 0.4.0 | 2021-10-04 | [\#6733](https://github.com/airbytehq/airbyte/issues/6733) | Support dataset starting with numbers | | |||
| 0.4.2 | 2021-10-26 | [\#7240](https://github.com/airbytehq/airbyte/issues/7240) | Output partitioned/clustered tables | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no functional difference (the app allows you to upgrade to whatever version) but that seems more "tidy"
* **Google BigQuery client chunk size**: Google BigQuery client's chunk\(buffer\) size \(MIN=1, MAX = 15\) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to [https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html) | ||
* **Transformation Priority**: configure the priority of queries run for transformations. Refer to [https://cloud.google.com/bigquery/docs/running-queries](https://cloud.google.com/bigquery/docs/running-queries). By default, Airbyte runs interactive query jobs on BigQuery, which means that the query is executed as soon as possible and count towards daily concurrent quotas and limits. If set to use batch query on your behalf, BigQuery starts the query as soon as idle resources are available in the BigQuery shared resource pool. This usually occurs within a few minutes. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when configuring this option, the only impact on Airbyte's sync is it's longer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and it's probably cheaper with more concurrent load as a trade-off
@@ -296,4 +305,68 @@ private void assertTmpTablesNotPresent(final List<String> tableNames) throws Int | |||
.collect(Collectors.toList()); | |||
} | |||
|
|||
@Test | |||
void testWritePartitionOverUnpartitioned() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love me a good test case 💪🏼
/publish connector=connectors/destination-bigquery
|
/publish connector=connectors/destination-bigquery-denormalized
|
) * [ airbytehq#5959 ][ airbytehq#2579 ] Add support of partitioned tables by _airbyte_emitted_at field (airbytehq#7141) Co-authored-by: Andrés Bravo <[email protected]>
What
Follow-up on community contribution from @andresbravog: #7141
Adding docs and releasing #7118 as well
Closes #2579
Closes #5959 by reverting the column type back to timestamp
How
Describe the solution
Recommended reading order
x.java
y.python
Pre-merge Checklist
Expand the relevant checklist and delete the others.
Updating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described here