Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove json avro schema converter hack #9232

Merged
merged 4 commits into from
Dec 31, 2021

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented Dec 31, 2021

What

  • In BigQuery/BiqQuery denorm Destinations : Add possibility to use different types of GCS files #8788, a hack is introduced to the Json Avro schema converter.
  • The BigQuery uploader converts the Json schema to both Avro and BigQuery schema.
    • The Avro schema is used to convert Json objects to Avro objects.
    • The BigQuery schema is used to create the BigQuery table.
  • The correct process should be:
    1. Json schema -> Avro schema
    2. Json schema -> BigQuery schema
  • The BigQuery uploader incorrectly uses the BigQuery schema as the input for 1.
    • So it is Json schema -> BigQuery schema -> Avro schema
  • This is problematic because BigQuery schema is actually similar to the Avro schema.
    • The date-time format in the Json schema is changed to timestamp-micros.
  • But the Json to Avro schema converter expects a pure Json schema.
  • Consequently, the Json to Avro schema converter needs to handle the Avro format timestamp-micros, which should not be necessarily.

How

  • This PR fixes this hack by using the correct Json schema as the input to the Json to Avro schema converter.

Review Order

  • GcsBigQueryDenormalizedRecordFormatter.java
  • JsonToAvroSchemaConverter.java.
  • All other changes are automatic changes that add finals by the IDE.

Checkline

  • This PR does not change any functionality. No connector publishing is needed.

@github-actions github-actions bot added the area/connectors Connector related issues label Dec 31, 2021
* https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types Replace
* date-time by timestamp
*/
textJson = textJson.replace("\"format\":\"date-time\"", "\"format\":\"timestamp-micros\"");
Copy link
Contributor Author

@tuliren tuliren Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to getBigQuerySchema because this is only needed in the BigQuery schema.

@@ -201,7 +201,6 @@ Schema getSingleFieldType(final String fieldName,
if (fieldDefinition.has("format")) {
final String format = fieldDefinition.get("format").asText();
fieldSchema = switch (format) {
case "timestamp-micros" -> LogicalTypes.timestampMicros().addToSchema(Schema.create(Schema.Type.LONG));
Copy link
Contributor Author

@tuliren tuliren Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the hack that should not be added in the first place. Json schema does not have such format. This is an Avro format.

@tuliren tuliren requested a review from sherifnada December 31, 2021 02:16
@tuliren tuliren temporarily deployed to more-secrets December 31, 2021 02:18 Inactive
@tuliren tuliren requested a review from DoNotPanicUA December 31, 2021 02:23
@tuliren tuliren temporarily deployed to more-secrets December 31, 2021 02:24 Inactive
@tuliren
Copy link
Contributor Author

tuliren commented Dec 31, 2021

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1639692275
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1639692275
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     124      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 494    313    37%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1199    501    58%

@tuliren
Copy link
Contributor Author

tuliren commented Dec 31, 2021

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1639692402
❌ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1639692402
🐛

@jrhizor jrhizor temporarily deployed to more-secrets December 31, 2021 04:14 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 31, 2021 04:14 Inactive
@tuliren
Copy link
Contributor Author

tuliren commented Dec 31, 2021

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1640030146
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1640030146
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     124      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 494    313    37%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1199    501    58%

@tuliren
Copy link
Contributor Author

tuliren commented Dec 31, 2021

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1640030503
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1640030503
No Python unittests run

@tuliren tuliren temporarily deployed to more-secrets December 31, 2021 07:07 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 31, 2021 07:08 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 31, 2021 07:08 Inactive
@tuliren tuliren merged commit e3b727b into master Dec 31, 2021
@tuliren tuliren deleted the liren/separate-json-avro-converter branch December 31, 2021 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3 Destination : Json to Avro converter should change format of date-time to timestamp-micros
3 participants