Fix integration test action #223

mihaitodor · 2022-10-16T00:21:37Z

Hey @twmb, following up on #218, I think this should do the trick using the bitnami/kafka container in Zookeeperless mode. I had to adjust the tests a bit so the KGO_SEEDS env var is read by each test which needs to connect to Kafka.

I also tried to use Redpanda, but there are some issues with it. Could you please let me know how you got the tests to pass against it? It may just be that I'm not configuring it correctly (maybe it doesn't like the fact that I'm running it as a single node?), but my Kafka knowledge is too limited to troubleshoot it. Here's what I tried:

> docker run --rm -p9092:9092 docker.vectorized.io/vectorized/redpanda:latest start --default-log-level=debug --smp 1 --reserve-memory 0M --overprovisioned --set redpanda.enable_transactions=true --kafka-addr 0.0.0.0:9092 --advertise-kafka-addr 127.0.0.1:9092
> # Wait for container to start
> KGO_TEST_RF=1 KGO_TEST_RECORDS=50000 go test -v -count 1 ./...

I pulled the current latest tag of the Redpanda container (sha256:84c8cd98ed7a50e4f378af55600c92c4fb097afab4ec164c3ded9b0e884754c4).

I guess this log entry might be relevant: [ERROR] fatal InitProducerID error, failing all buffered records; broker: 1, err: INVALID_PRODUCER_EPOCH: Producer attempted an operation with an old epoch.

Here are the full logs:

redpanda.log
test.log

ssorren · 2022-10-16T14:07:48Z

As far as I know, Redpanda does not yet support transactions. I encountered the same error myself and read their docs.

mihaitodor · 2022-10-16T14:10:06Z

I thought it does if you use --set redpanda.enable_transactions=true, but maybe the implementation isn't complete yet.

ssorren · 2022-10-16T14:17:57Z

I thought it was that the community edition did not support it, I may be mistaken

ssorren · 2022-10-16T14:45:51Z

Ahhh: Compacted topic doesn't support transactional processing from the Redpanda logs. That's what is causing my issue. Not sure about the integration tests.

from their docs:

Is Redpanda transaction support complete?

No, it isn't. Redpanda transactions don't cover compacted topics, also we don't support KIP-447 and KIP-360 yet. Overall transactional support is on the Kafka 2.4 level. But we work on catching up.

mihaitodor · 2022-10-16T15:18:11Z

Oh, I see. Thank you for digging that up! ❤️ Dunno if the failures I'm seeing are related (maybe @twmb knows), but, hopefully, they'll catch up soon :)

twmb · 2022-10-16T23:49:14Z

I thought I tested against Redpanda above, but I had a stray Kafka broker alive which allowed the tests to be successful. Redpanda's transactions are a bit WIP right now and should be fixed in the next release. I'll test against it, but this repo should stick to Kafka for now.

mihaitodor · 2022-10-17T00:58:21Z

OK, makes sense to stick with Kafka for now and maybe add Redpanda as an extra test in the future.

Not sure what to make of the failure... It worked great for me: https://github.com/mihaitodor/franz-go/actions/runs/3257547284/jobs/5348876552. Hope it's not a heisenbug.

LE: I just ran it again on another branch in my fork and it passed: https://github.com/mihaitodor/franz-go/actions/runs/3261604515/jobs/5356889078

LE2: Definitely a heisenbug 😕 I ran it again here. Guess this log entry is relevant:

2022-10-17T01:13:42.3034436Z  [2022-10-17 01:13:26,007] ERROR [ReplicaManager broker=1] Error processing append operation on partition 48ed629636ee415e86260963f079acbdd4d0e6f29f8d20e230715d8a5933f99e-19 (kafka.server.ReplicaManager)
2022-10-17T01:13:42.3035125Z  org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producer 46 at offset 808 in partition 48ed629636ee415e86260963f079acbdd4d0e6f29f8d20e230715d8a5933f99e-19: 0 (incoming seq. number), 2516 (current end sequence number)

I wonder if it might be due to Kafka taking too long to start... I didn't add any sort of health check on the container, since it looks like the tests try to reconnect a bunch of times before giving up. I'm surprised it didn't reproduce on localhost. I might try it again a few more times to see.

twmb · 2022-10-17T02:38:37Z

I'll see if I can recreate this locally -- a sequence number reset can happen in a few scenarios, but they shouldn't be happening in this test.

mihaitodor · 2022-10-17T13:08:23Z

Since Kafka is running in Zookeeperless mode, I wonder if it might have some rough corners. Hopefully not.

twmb · 2022-10-17T22:48:42Z

It looks like Kafka isn't respecting its 5-wide idempotency window. I'm checking to see if something in Bitnami's image specifically changed it, or what. From local reproducing,

send partition 13 sequence # start 0
send partition 13 sequence # start 701
failure, retry the above two
receive OOOSN because Kafka is not de-duplicating the repeated messages

twmb · 2022-10-17T23:16:26Z

I think this is bringing up an issue that has always been possible in Kafka, but is now visible with whatever changes they're making in Kraft. Apparently, the first produce request's sequence number does not need to start at 0. That's problematic:

client creates topic foo, kafka replies "sure, created"
client sends produce request A with 5 messages, sends produce request B with 5 messages before a response for A is received (base sequence numbers 0 and 5)
request A is rejected with NOT_LEADER_FOR_PARTITION because the topic is not fully initialized, request B is accepted
Kafka now has 5 records locally, starting at a base sequence of 5 and it is expecting 10
client re-sends request A and B
request A is rejected as out of order, since it has base sequence 0

I'll check the Kafka source to see if it truly does accept a non-zero sequence number for the first request.

It seems that Kraft replies NOT_LEADER_FOR_PARTITION for a bit, quite a bit more frequently.

twmb · 2022-10-17T23:35:59Z

https://github.com/apache/kafka/blob/3e7eddecd6a63ea6a9793d3270bef6d0be5c9021/core/src/main/scala/kafka/log/ProducerStateManager.scala#L235-L236

mihaitodor · 2022-10-18T00:04:52Z

Very interesting, thanks for sharing that insight!

Out of curiosity, I also tried to see if I can use Confluent's Kafka container in Zookeeperless mode, since they offer some setup example here https://github.com/confluentinc/cp-all-in-one/blob/7.2.1-post/cp-all-in-one-kraft/docker-compose.yml but I guess I'm doing smth wrong with the way I'm volume mounting that silly shell script: https://github.com/mihaitodor/franz-go/actions/runs/3269542233/jobs/5377158865. GitHub has some bugs around volume mounting stuff in service containers and getting around them is very frustrating... Anyway, I guess it will behave the same as the Bitnami one based on your investigation above.

twmb · 2022-10-18T00:19:02Z

For context (creating Kafka issue now), this is my docker compose:

version: "3.7"
services:
  kafka:
    image: bitnami/kafka:latest
    network_mode: host
    environment:
      KAFKA_ENABLE_KRAFT: yes
      KAFKA_CFG_PROCESS_ROLES: controller,broker
      KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: [email protected]:9093
      # Set this to "PLAINTEXT://127.0.0.1:9092" if you want to run this container on localhost via Docker
      KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:9092
      KAFKA_CFG_BROKER_ID: 1
      ALLOW_PLAINTEXT_LISTENER: yes
      KAFKA_KRAFT_CLUSTER_ID: XkpGZQ27R3eTl3OdTm2LYA # 16 byte base64-encoded UUID
      BITNAMI_DEBUG: true # Enable this to get more info on startup failures

How I am running this:

export KGO_TEST_RF=1
export KGO_TEST_RECORDS=50000
export KGO_LOG_LEVEL=debug

and then

docker compose down; sleep 1; docker compose up -d; sleep 5; while go test -run '(Txn|Group)' > logs ; do echo "whoo"; docker compose down; sleep 1 ; docker compose up -d ; sleep 5; done

twmb · 2022-10-18T00:33:00Z

https://issues.apache.org/jira/browse/KAFKA-14312
https://issues.apache.org/jira/browse/KAFKA-14313

AFAICT, there is no solution to 14312.

One approach I could try is to sleep 1s after creating topics in the test.

twmb · 2022-10-18T17:57:26Z

Another odd issue,

__transaction_state partition 33 offset 7 at [2022-10-18 11:15:37.821]
TxnMetadataKey(0) 9f87dc04dc3f4d5b15ef3072c531cf46327278307df8e149fa966462cd40c10b
TxnMetadataValue(0)
      ProducerID           41
      ProducerEpoch        0
      TimeoutMillis        120000
      State                PrepareCommit
      Topics               __consumer_offsets=>[13] e7c7d971626fbaf4bfb33975e57089167939e6acabb4c4fc534eb148462e45cc=>[4 5 12 16]  
      LastUpdateTimestamp  1666113337821
      StartTimestamp       1666113335311


__transaction_state partition 33 offset 8 at [2022-10-18 11:16:11.419]
TxnMetadataKey(0) 9f87dc04dc3f4d5b15ef3072c531cf46327278307df8e149fa966462cd40c10b
TxnMetadataValue(0)
      ProducerID           41
      ProducerEpoch        0
      TimeoutMillis        120000
      State                CompleteCommit
      Topics     
      LastUpdateTimestamp  1666113337821
      StartTimestamp       1666113335311

Kafka took 34 seconds to finish a commit within itself. This is causing duplicates, because a rebalance happened in the middle and moved partitions.

Technically, I should be using RequireStableFetchOffsets. I'm not sure why Kafka took 34 seconds to finish a transaction within itself, though -- the client goes to a lot of lengths to avoid the need for RequireStableFetchOffsets. I'll add the option and see how things go ...

Opened https://issues.apache.org/jira/browse/KAFKA-14315 to track this

See large comment. For #223.

Single-broker localhost kraft testing shows this error being returned **much** more frequently. Rather than having a 5s timeout, we now continue until the input context is canceled. This largely affects ending a transaction. We also improve the locking situation in one area. It's a bit ugly, but the solution is better than unlocking directly at every return site. For #223.

See large comment. For #223.

Single-broker localhost kraft testing shows this error being returned **much** more frequently. Rather than having a 5s timeout, we now continue until the input context is canceled. This largely affects ending a transaction. We also improve the locking situation in one area. It's a bit ugly, but the solution is better than unlocking directly at every return site. For #223.

twmb · 2022-10-18T23:07:47Z

I've added a bunch of commits in #224 that should address failures, and cherry picked this commit. I've looped tests for a few hours and have not seen any failures locally. Closing this and merging the other PR.

This also finally makes it worth it to enable branch protection (and require all commits to go through PRs), which is also nice because I've occasionally introduced a compile failure without testing (and immediately fixed).

Now that this is merged, I plan to address 214, 220, 221, 222 and then release 1.19. Hopefully the rest of this is quick.

mihaitodor · 2022-10-18T23:15:59Z

That's really awesome, thank you for all this work @twmb!

twmb · 2022-10-20T03:55:13Z

For full posterity in case I need it later,

#!/bin/bash

docker compose down
sleep 1
docker compose up -d

until rpk topic list; do
	sleep 1
done

while go test -run Txn > logs; do
	echo "whoo"
	docker compose down
	sleep 1
	docker compose up -d
	until rpk topic list; do
		sleep 1
	done
done

Also if you want to help with #225, I need to change it to expose a port ;)
Turns out redpanda nightly does work, so long as I disable unsafe behavior in tests.

Some bug fixes [went in](https://github.com/twmb/franz-go/blob/master/CHANGELOG.md#v190) following the PR I raised here: twmb/franz-go#223. Since Travis was kind enough to also fix the old ARM builds, I thought it should be fine to re-enable them by reverting 414cf1f.

Fix integration test action

59f3a81

twmb added a commit that referenced this pull request Oct 18, 2022

kgo sink: only allow more than one in flight if we have ok response

bc8c495

See large comment. For #223.

twmb mentioned this pull request Oct 18, 2022

fixup kraft #224

Merged

twmb added a commit that referenced this pull request Oct 18, 2022

kgo sink: only allow more than one in flight if we have ok response

5567018

See large comment. For #223.

twmb closed this Oct 18, 2022

This was referenced Oct 21, 2022

Update kafka-franz and re-enable old ARM builds redpanda-data/connect#1534

Merged

Fix Redpanda tests #233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix integration test action #223

Fix integration test action #223

mihaitodor commented Oct 16, 2022

ssorren commented Oct 16, 2022

mihaitodor commented Oct 16, 2022

ssorren commented Oct 16, 2022

ssorren commented Oct 16, 2022 •

edited

Loading

mihaitodor commented Oct 16, 2022

twmb commented Oct 16, 2022

mihaitodor commented Oct 17, 2022 •

edited

Loading

twmb commented Oct 17, 2022

mihaitodor commented Oct 17, 2022

twmb commented Oct 17, 2022

twmb commented Oct 17, 2022

twmb commented Oct 17, 2022

mihaitodor commented Oct 18, 2022

twmb commented Oct 18, 2022 •

edited

Loading

twmb commented Oct 18, 2022

twmb commented Oct 18, 2022 •

edited

Loading

twmb commented Oct 18, 2022

mihaitodor commented Oct 18, 2022

twmb commented Oct 20, 2022

Fix integration test action #223

Fix integration test action #223

Conversation

mihaitodor commented Oct 16, 2022

ssorren commented Oct 16, 2022

mihaitodor commented Oct 16, 2022

ssorren commented Oct 16, 2022

ssorren commented Oct 16, 2022 • edited Loading

mihaitodor commented Oct 16, 2022

twmb commented Oct 16, 2022

mihaitodor commented Oct 17, 2022 • edited Loading

twmb commented Oct 17, 2022

mihaitodor commented Oct 17, 2022

twmb commented Oct 17, 2022

twmb commented Oct 17, 2022

twmb commented Oct 17, 2022

mihaitodor commented Oct 18, 2022

twmb commented Oct 18, 2022 • edited Loading

twmb commented Oct 18, 2022

twmb commented Oct 18, 2022 • edited Loading

twmb commented Oct 18, 2022

mihaitodor commented Oct 18, 2022

twmb commented Oct 20, 2022

ssorren commented Oct 16, 2022 •

edited

Loading

mihaitodor commented Oct 17, 2022 •

edited

Loading

twmb commented Oct 18, 2022 •

edited

Loading

twmb commented Oct 18, 2022 •

edited

Loading