Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer offset resets to BEGINNING on SSL errors on 2.1.0 and 2.1.1 versions #1577

Closed
5 of 7 tasks
robsys opened this issue Jun 1, 2023 · 5 comments
Closed
5 of 7 tasks

Comments

@robsys
Copy link

robsys commented Jun 1, 2023

Description

On one of our consumers, we get an SSL_HANDSHAKE error which causes Partition log truncation detected after which offset is suddenly reset back to BEGINNING, and of course, this causes huge consumer lag. It seems that the issue points exactly at https://github.com/confluentinc/librdkafka/releases/tag/v2.1.0:

KIP-320
Allow fetchers to detect and handle log truncation (confluentinc/librdkafka#4122).

We tried bumping to 2.1.1 although it seems it does not have a fix. Our other consumers are running on 1.9.2 and don't have this issue.

Logs:

INFO:  %3|1685564449.205|FAIL|...-...-client#consumer-1| [thrd:sasl_ssl://<url>/1]: sasl_ssl://<url>/14: SSL handshake failed: error:0A000126:SSL routines::unexpected eof while reading (after 5003ms in state SSL_HANDSHAKE)
INFO:  %4|1685564449.205|OFFSET|...-...-client#consumer-1| [thrd:main]: <topic> [4]: offset reset (at offset INVALID (leader epoch 52), broker 14) to offset BEGINNING (leader epoch -1): Unable to validate offset and epoch: Local: SSL error: Local: Partition log truncation detected
INFO:  %3|1685564454.389|FAIL|...-...-client#consumer-1| [thrd:sasl_ssl://<url>/1]: sasl_ssl://<url>/14: SSL handshake failed: error:0A000126:SSL routines::unexpected eof while reading (after 5002ms in state SSL_HANDSHAKE, 1 identical error(s) suppressed)
INFO:  %4|1685564454.389|OFFSET|...-...-client#consumer-1| [thrd:main]: <topic> [4]: offset reset (at offset BEGINNING (leader epoch -1), broker 14) to offset BEGINNING (leader epoch -1): failed to query logical offset: Local: SSL error
ERROR: Error reading message received at topic <topic>: Failed to query logical offset BEGINNING: Local: SSL error
EXCEPTION: KafkaError{code=_SSL,val=-181,str="Failed to query logical offset BEGINNING: Local: SSL error"}

How to reproduce

I wasn't able to reproduce it locally.

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version: (2.1.0 / 2.1.0) and (2.1.1 / 2.1.1)
  • Apache Kafka broker version:
  • Client configuration:
{
    "bootstrap.servers": "...",
    "group.id": "...",
    "client.id": "...",
    "session.timeout.ms": 6000,
    "auto.offset.reset": "earliest",
    "max.poll.interval.ms": 600000,
}
  • Operating system: debian:11.3-slim
  • Provide client logs (with 'debug': '..' as necessary)
  • Provide broker log excerpts
  • Critical issue
@emasab
Copy link
Contributor

emasab commented Jun 1, 2023

Hi @robsys. We're aware of the issue, tracked in librdkafka here. confluentinc/librdkafka#4293
There's this PR for fixing that confluentinc/librdkafka#4294.

Currently what you can do is revert to 2.0.2 that doesn't have offset validation so you won't have offset resets, but that error could cause problems in other places, or revert to 1.9.2 that doesn't have OpenSSL 3.

Next release is planned for the end of June.

@robsys
Copy link
Author

robsys commented Jun 1, 2023

Thank you, it was a couple of weeks ago and very recently when it occurred, didn't check all the new issues. Glad to know it's a known issue already. We have downgraded to 1.9.2 for the time being. Should we keep this open until it's fixed?

@emasab
Copy link
Contributor

emasab commented Jun 1, 2023

Let's keep it open to make it visible in this repo too. We'll close it once it's merged.

@ElliotJBall
Copy link

Hi there @emasab, can I confirm that since the release of librdkafka version 2.2.0, this issue has been resolved as of version 2.2.0 of this library? This issue can now be marked as closed?

We also ran into this in our production environment and have since pinned to an earlier version, I am looking at now using the latest version and want to confirm the above

@helloHKTK
Copy link

helloHKTK commented Nov 9, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants