Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0x00 never supported in Postgres, automatic stripping #1216

Open
FreCap opened this issue Jul 9, 2022 · 13 comments
Open

0x00 never supported in Postgres, automatic stripping #1216

FreCap opened this issue Jul 9, 2022 · 13 comments

Comments

@FreCap
Copy link

FreCap commented Jul 9, 2022

When copying over data from MySQL to Postgres I hit the exception:

  Where: unnamed portal parameter $19  Call getNextException to see other errors in the batch.
org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00

According to: https://stackoverflow.com/a/1348551, 0x00 is never a supported character

PostgreSQL doesn't support storing NULL (\0x00) characters in text fields (this is obviously different from the database NULL value, which is fully supported).

Source: http://www.postgresql.org/docs/9.1/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

Question
Since we cannot change the source data (being a downstream only service), what could be the best strategy to remove the 0x00 characters?
How could/Should this be baked into JDBCSinkConnector?

@k1th
Copy link
Contributor

k1th commented Sep 9, 2022

is anyone interested in this change? We could do this in io.confluent.connect.jdbc.dialect.PostgreSqlDatabaseDialect#maybeBindPrimitive
for example, adding a version of

      case STRING:
        statement.setString(index, (String) value);
        break;

that replaces \u0000 chars with nothing…

@oleksandrkovalenko
Copy link

I would like to work on this one. I will try to come with PR during the weekend.

@oleksandrkovalenko
Copy link

oleksandrkovalenko commented Nov 4, 2022

I created a PR: #1264

@lopezco
Copy link

lopezco commented Dec 15, 2022

@oleksandrkovalenko any news on the PR #1264 being accepted?
or a workaround?

@oleksandrkovalenko
Copy link

I don't know anything more than PR itself, which is silence. Maybe we could tag somebody, but I don't know who it could be.

Our workaround if you are interested: We had only one such case and have manually moved the offset in Kafka to unblock processing. Also, we requested changes in the source MySQL database to get new records in Kafka without \0x00.

As we are running Kafka connect using our custom image, we are considering replacing the original connector from maven-central with our custom-built connector if the case occurs again. By the way, I just realised that I didn't try to build a jar from the sources. All in all not a blocker for us, just learning about \0x00 differences between Postgres and MySQL.

@rohitkou
Copy link

rohitkou commented Mar 17, 2023

Thank you, @oleksandrkovalenko,
I encountered a similar problem while configuring a data pipeline between two Postgres servers. The error message was 'invalid byte sequence for encoding "UTF8": 0x00. Where: unnamed portal parameter $1 Call getNextException to see other errors in the batch.'

Your solution resolved the issue for me. It's unclear why this problem occurred in our case and not for others postgres users.

@cyberjar09
Copy link

im considering just writing an SMT which removes the character, what do you'll think?

p.s. even if someone was to review and approve, there appears to be a jenkins CI failure.

@oleksandrkovalenko
Copy link

I know, but there are no errors logged, so I am not sure what should be fixed. Any thought? @cyberjar09

@cyberjar09
Copy link

sorry @oleksandrkovalenko have not checked on whats the failure but FWIW, I wrote the SMT which fixes my problem, if you are interested here is the link: https://github.com/cyberjar09/StripUnicodeNullTransform

@roadSurfer
Copy link

roadSurfer commented Apr 27, 2023

I've written an SMT that can replace the \u0000 with another configured character (such as \ufffd) or nothing.
It's currently proprietary, but if I can commit the code back/release it I will.

@mbauer90
Copy link

mbauer90 commented Feb 7, 2024

Hi, @cyberjar09

This https://github.com/cyberjar09/StripUnicodeNullTransform using in your SMT sink?

@roadSurfer your have the code and example?

@cyberjar09
Copy link

Hi, @cyberjar09

This https://github.com/cyberjar09/StripUnicodeNullTransform using in your SMT sink?

@roadSurfer your have the code and example?

I was using it, but we moved away from debezium as a solution so ive decommissioned it since

@roadSurfer
Copy link

Yes, I have an updated example here: https://github.com/roadSurfer/StripUnicodeNullTransform

It's little more than a clone of @cyberjar09's with some extra flexibility and changes based on SonarLint reports.

We actually ditched the SMT in favour of have the Processors strip the null chars as they went through the incoming messages. This seemed to have less of a performance impact (for us, with our data, on our hardware; you would need to do your own testing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants