-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix integration test action #223
Conversation
As far as I know, Redpanda does not yet support transactions. I encountered the same error myself and read their docs. |
I thought it does if you use |
I thought it was that the community edition did not support it, I may be mistaken |
Ahhh: from their docs:
No, it isn't. Redpanda transactions don't cover compacted topics, also we don't support KIP-447 and KIP-360 yet. Overall transactional support is on the Kafka 2.4 level. But we work on catching up. |
Oh, I see. Thank you for digging that up! ❤️ Dunno if the failures I'm seeing are related (maybe @twmb knows), but, hopefully, they'll catch up soon :) |
I thought I tested against Redpanda above, but I had a stray Kafka broker alive which allowed the tests to be successful. Redpanda's transactions are a bit WIP right now and should be fixed in the next release. I'll test against it, but this repo should stick to Kafka for now. |
OK, makes sense to stick with Kafka for now and maybe add Redpanda as an extra test in the future. Not sure what to make of the failure... It worked great for me: https://github.com/mihaitodor/franz-go/actions/runs/3257547284/jobs/5348876552. Hope it's not a heisenbug. LE: I just ran it again on another branch in my fork and it passed: https://github.com/mihaitodor/franz-go/actions/runs/3261604515/jobs/5356889078 LE2: Definitely a heisenbug 😕 I ran it again here. Guess this log entry is relevant:
I wonder if it might be due to Kafka taking too long to start... I didn't add any sort of health check on the container, since it looks like the tests try to reconnect a bunch of times before giving up. I'm surprised it didn't reproduce on localhost. I might try it again a few more times to see. |
I'll see if I can recreate this locally -- a sequence number reset can happen in a few scenarios, but they shouldn't be happening in this test. |
Since Kafka is running in Zookeeperless mode, I wonder if it might have some rough corners. Hopefully not. |
It looks like Kafka isn't respecting its 5-wide idempotency window. I'm checking to see if something in Bitnami's image specifically changed it, or what. From local reproducing,
|
I think this is bringing up an issue that has always been possible in Kafka, but is now visible with whatever changes they're making in Kraft. Apparently, the first produce request's sequence number does not need to start at 0. That's problematic:
I'll check the Kafka source to see if it truly does accept a non-zero sequence number for the first request. It seems that Kraft replies NOT_LEADER_FOR_PARTITION for a bit, quite a bit more frequently. |
Very interesting, thanks for sharing that insight! Out of curiosity, I also tried to see if I can use Confluent's Kafka container in Zookeeperless mode, since they offer some setup example here https://github.com/confluentinc/cp-all-in-one/blob/7.2.1-post/cp-all-in-one-kraft/docker-compose.yml but I guess I'm doing smth wrong with the way I'm volume mounting that silly shell script: https://github.com/mihaitodor/franz-go/actions/runs/3269542233/jobs/5377158865. GitHub has some bugs around volume mounting stuff in service containers and getting around them is very frustrating... Anyway, I guess it will behave the same as the Bitnami one based on your investigation above. |
For context (creating Kafka issue now), this is my docker compose: version: "3.7"
services:
kafka:
image: bitnami/kafka:latest
network_mode: host
environment:
KAFKA_ENABLE_KRAFT: yes
KAFKA_CFG_PROCESS_ROLES: controller,broker
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: [email protected]:9093
# Set this to "PLAINTEXT://127.0.0.1:9092" if you want to run this container on localhost via Docker
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:9092
KAFKA_CFG_BROKER_ID: 1
ALLOW_PLAINTEXT_LISTENER: yes
KAFKA_KRAFT_CLUSTER_ID: XkpGZQ27R3eTl3OdTm2LYA # 16 byte base64-encoded UUID
BITNAMI_DEBUG: true # Enable this to get more info on startup failures How I am running this:
and then
|
https://issues.apache.org/jira/browse/KAFKA-14312 AFAICT, there is no solution to 14312. One approach I could try is to sleep 1s after creating topics in the test. |
Another odd issue,
Kafka took 34 seconds to finish a commit within itself. This is causing duplicates, because a rebalance happened in the middle and moved partitions. Technically, I should be using Opened https://issues.apache.org/jira/browse/KAFKA-14315 to track this |
Single-broker localhost kraft testing shows this error being returned **much** more frequently. Rather than having a 5s timeout, we now continue until the input context is canceled. This largely affects ending a transaction. We also improve the locking situation in one area. It's a bit ugly, but the solution is better than unlocking directly at every return site. For #223.
Single-broker localhost kraft testing shows this error being returned **much** more frequently. Rather than having a 5s timeout, we now continue until the input context is canceled. This largely affects ending a transaction. We also improve the locking situation in one area. It's a bit ugly, but the solution is better than unlocking directly at every return site. For #223.
I've added a bunch of commits in #224 that should address failures, and cherry picked this commit. I've looped tests for a few hours and have not seen any failures locally. Closing this and merging the other PR. This also finally makes it worth it to enable branch protection (and require all commits to go through PRs), which is also nice because I've occasionally introduced a compile failure without testing (and immediately fixed). Now that this is merged, I plan to address 214, 220, 221, 222 and then release 1.19. Hopefully the rest of this is quick. |
That's really awesome, thank you for all this work @twmb! |
For full posterity in case I need it later, #!/bin/bash
docker compose down
sleep 1
docker compose up -d
until rpk topic list; do
sleep 1
done
while go test -run Txn > logs; do
echo "whoo"
docker compose down
sleep 1
docker compose up -d
until rpk topic list; do
sleep 1
done
done Also if you want to help with #225, I need to change it to expose a port ;) |
Some bug fixes [went in](https://github.com/twmb/franz-go/blob/master/CHANGELOG.md#v190) following the PR I raised here: twmb/franz-go#223. Since Travis was kind enough to also fix the old ARM builds, I thought it should be fine to re-enable them by reverting 414cf1f.
Hey @twmb, following up on #218, I think this should do the trick using the
bitnami/kafka
container in Zookeeperless mode. I had to adjust the tests a bit so theKGO_SEEDS
env var is read by each test which needs to connect to Kafka.I also tried to use Redpanda, but there are some issues with it. Could you please let me know how you got the tests to pass against it? It may just be that I'm not configuring it correctly (maybe it doesn't like the fact that I'm running it as a single node?), but my Kafka knowledge is too limited to troubleshoot it. Here's what I tried:
I pulled the current
latest
tag of the Redpanda container (sha256:84c8cd98ed7a50e4f378af55600c92c4fb097afab4ec164c3ded9b0e884754c4
).I guess this log entry might be relevant: [ERROR] fatal InitProducerID error, failing all buffered records; broker: 1, err: INVALID_PRODUCER_EPOCH: Producer attempted an operation with an old epoch.
Here are the full logs:
redpanda.log
test.log