Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization handles quote in column names #5027

Merged
merged 6 commits into from
Jul 28, 2021

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented Jul 27, 2021

What

This PR depends on some changes made in bigquery/mysql destinations: #5026

Closes #4729

To avoid noise, output test files are regenerated and part of another PR #5028

How

Parse JSON blob while managing quote characters depending on the destination

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Expand the checklist which is relevant for this PR.

Connector checklist

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • Secrets are annotated with airbyte_secret in the connector's spec
  • Credentials added to Github CI if needed and not already present. instructions for injecting secrets into CI.
  • Unit & integration tests added as appropriate (and are passing)
    • Community members: please provide proof of this succeeding locally e.g: screenshot or copy-paste acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • /test connector=connectors/<name> command as documented here is passing.
    • Community members can skip this, Airbyters will run this for you.
  • Code reviews completed
  • Documentation updated
    • README.md
    • docs/SUMMARY.md if it's a new connector
    • Created or updated reference docs in docs/integrations/<source or destination>/<name>.
    • Changelog in the appropriate page in docs/integrations/.... See changelog example
    • docs/integrations/README.md contains a reference to the new connector
    • Build status added to build page
  • Build is successful
  • Connector version bumped like described here
  • New Connector version released on Dockerhub by running the /publish command described here
  • No major blockers
  • PR merged into master branch
  • Follow up tickets have been created
  • Associated tickets have been closed & stakeholders notified

Connector Generator checklist

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@@ -8,6 +8,7 @@ select
_airbyte_nested_stream_with_complex_columns_resulting_into_long_names_hashid,
json_extract_array(`partition`, "$['double_array_data']") as double_array_data,
json_extract_array(`partition`, "$['DATA']") as DATA,
json_extract_array(`partition`, "$['column___with__quotes']") as column___with__quotes,
Copy link
Contributor Author

@ChristopheDuong ChristopheDuong Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to PR #5026, the raw json blob in BigQuery also contains some sanitized columns that we can extract from where quote characters have already been replaced by _

'$."DATA"') as `DATA`,
'$."DATA"') as `DATA`,
json_extract(`partition`,
'$."column___with__quotes"') as `column__'with"_quotes`,
Copy link
Contributor Author

@ChristopheDuong ChristopheDuong Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to PR #5026, the raw json blob in MySql also contains some sanitized columns that we can extract from where quote characters have already been replaced by _

Notice that the json_extract can't handle quotes in the json path but it is possible to create column names using ' or " in MySQL...

@github-actions github-actions bot added area/documentation Improvements or additions to documentation area/worker Related to worker labels Jul 27, 2021
@@ -6,6 +6,7 @@ select
_airbyte_nested_stre__nto_long_names_hashid,
jsonb_extract_path("partition", 'double_array_data') as double_array_data,
jsonb_extract_path("partition", 'DATA') as "DATA",
jsonb_extract_path("partition", 'column`_''with"_quotes') as "column`_'with""_quotes",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres is able to parse the json blob just fine. It just has an edge case of doubling the ' characters to "escape" it...

@@ -6,6 +6,7 @@ select
_AIRBYTE_NESTED_STREAM_WITH_COMPLEX_COLUMNS_RESULTING_INTO_LONG_NAMES_HASHID,
get_path(parse_json(PARTITION), '"double_array_data"') as DOUBLE_ARRAY_DATA,
get_path(parse_json(PARTITION), '"DATA"') as DATA,
get_path(parse_json(PARTITION), '"column`_''with""_quotes"') as "column`_'with""_quotes",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake is able to parse the json blob just fine. It just has an edge case of doubling the ' or '"' characters to "escape" them...

Base automatically changed from chris/handle-quote-destinations to master July 28, 2021 12:38
@ChristopheDuong ChristopheDuong merged commit d6429a4 into master Jul 28, 2021
@ChristopheDuong ChristopheDuong deleted the chris/handle-quote-normalization branch July 28, 2021 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation area/worker Related to worker normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MSSQL -> Snowflake] Normalization issue with ' in columns names
2 participants