-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 BigQuery destination: use serialized buffer for gcs staging #11776
Conversation
/test connector=connectors/destination-bigquery
|
/test connector=connectors/destination-bigquery
|
/test connector=connectors/destination-bigquery-denormalized
|
@@ -12,9 +12,8 @@ application { | |||
dependencies { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention this on the previous S3/GCS PR but can we make the integration tests of warehouses destinations (bigquery / snowflake / etc in the future) dependent from blob storage destinations that are used underneath for staging?
So if we run ./gradlew :airbyte-integrations:connectors:destination-gcs:integrationTest
, it should also run: ./gradlew :airbyte-integrations:connectors:destination-bigquery*:integrationTest
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I created an issue here: #11815
/** | ||
* This class differs from {@link AvroSerializedBuffer} in that 1) the Avro schema can be customized | ||
* by the caller, and 2) the message is formatted by {@link BigQueryRecordFormatter}. In this way, | ||
* this buffer satisfies the needs of both the standard and the denormalized BigQuery destinations. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we introduce a common concept to handle this though?
If I understand correctly, this sounds like the CsvSheetGenerator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, only BigQuery needs a separate implementation, because the BigQuery schema is a bit hacky. I doubt if other destinations will ever need this. So I think something more generic is not necessary. Maybe let's wait until there is another case use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
return getRecordConsumer(getUploaderMap(config, catalog), outputRecordCollector); | ||
final UploadingMethod uploadingMethod = BigQueryUtils.getLoadingMethod(config); | ||
if (uploadingMethod == UploadingMethod.STANDARD) { | ||
return getStandardRecordConsumer(config, catalog, outputRecordCollector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should there be some warning printed in the logs to warn users and recommend using staging options rather than direct upload methods though? (especially when dealing with larger streams)
(we could even suggest which format should be used to stage the data in the message if that's exposed to users?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's always staging in avro...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I will also update the document to mention that.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class BigQueryStagingConsumerFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be possible to get a common staging consumer factory from snowflake and bigquery?
we probably don't want to have their own staging consumer factory for each destination moving forward, no? (redshift, etc)
This could definitely be another PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to reuse the existing factory or modify the Snowflake factory to fit in BigQuery. However, it complicates the factory a lot, because BigQuery is different and it has its own client. I don't think we need another consumer factory for destinations like Redshift, because most destinations are JDBC based, and fit into the current StagingConsumerFactory
.
/** | ||
* @param datasetId the dataset ID is equivalent to output schema | ||
*/ | ||
public record BigQueryWriteConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might be opportunity to DRY this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class also has many BigQuery-specific logic (e.g. TableId
). So I'd rather not to merge it with the other write configs.
/publish connector=connectors/destination-bigquery
|
/publish connector=connectors/destination-bigquery-denormalized
|
What
How
BufferedStreamConsumer
, and reuses some of the BigQuery uploader,BigQueryUtils
andBigQueryRecordFormatter
logic.BigQueryStagingOperations
interface is introduced to handle the staging operations involving the BigQuery client. It is similar toStagingOperations
. But since BigQuery requires lots of unique logic, a new interface is necessary. The implementation wraps the behaviors from the BigQuery uploaders.Recommended reading order
BigQueryDestination.java
BigQueryStagingConsumerFactory.java
BigQueryStagingOperations.java
🚨 User Impact 🚨
None expected.
TODO
Pre-merge Checklist
Updating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing/publish
command described here