Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix integration test action #223

Closed

Conversation

mihaitodor
Copy link
Contributor

Hey @twmb, following up on #218, I think this should do the trick using the bitnami/kafka container in Zookeeperless mode. I had to adjust the tests a bit so the KGO_SEEDS env var is read by each test which needs to connect to Kafka.

I also tried to use Redpanda, but there are some issues with it. Could you please let me know how you got the tests to pass against it? It may just be that I'm not configuring it correctly (maybe it doesn't like the fact that I'm running it as a single node?), but my Kafka knowledge is too limited to troubleshoot it. Here's what I tried:

> docker run --rm -p9092:9092 docker.vectorized.io/vectorized/redpanda:latest start --default-log-level=debug --smp 1 --reserve-memory 0M --overprovisioned --set redpanda.enable_transactions=true --kafka-addr 0.0.0.0:9092 --advertise-kafka-addr 127.0.0.1:9092
> # Wait for container to start
> KGO_TEST_RF=1 KGO_TEST_RECORDS=50000 go test -v -count 1 ./...

I pulled the current latest tag of the Redpanda container (sha256:84c8cd98ed7a50e4f378af55600c92c4fb097afab4ec164c3ded9b0e884754c4).

I guess this log entry might be relevant: [ERROR] fatal InitProducerID error, failing all buffered records; broker: 1, err: INVALID_PRODUCER_EPOCH: Producer attempted an operation with an old epoch.

Here are the full logs:

redpanda.log
test.log

@ssorren
Copy link

ssorren commented Oct 16, 2022

As far as I know, Redpanda does not yet support transactions. I encountered the same error myself and read their docs.

@mihaitodor
Copy link
Contributor Author

I thought it does if you use --set redpanda.enable_transactions=true, but maybe the implementation isn't complete yet.

@ssorren
Copy link

ssorren commented Oct 16, 2022

I thought it was that the community edition did not support it, I may be mistaken

@ssorren
Copy link

ssorren commented Oct 16, 2022

Ahhh: Compacted topic doesn't support transactional processing from the Redpanda logs. That's what is causing my issue. Not sure about the integration tests.

from their docs:

Is Redpanda transaction support complete?

No, it isn't. Redpanda transactions don't cover compacted topics, also we don't support KIP-447 and KIP-360 yet. Overall transactional support is on the Kafka 2.4 level. But we work on catching up.

@mihaitodor
Copy link
Contributor Author

Oh, I see. Thank you for digging that up! ❤️ Dunno if the failures I'm seeing are related (maybe @twmb knows), but, hopefully, they'll catch up soon :)

@twmb
Copy link
Owner

twmb commented Oct 16, 2022

I thought I tested against Redpanda above, but I had a stray Kafka broker alive which allowed the tests to be successful. Redpanda's transactions are a bit WIP right now and should be fixed in the next release. I'll test against it, but this repo should stick to Kafka for now.

@mihaitodor
Copy link
Contributor Author

mihaitodor commented Oct 17, 2022

OK, makes sense to stick with Kafka for now and maybe add Redpanda as an extra test in the future.

Not sure what to make of the failure... It worked great for me: https://github.com/mihaitodor/franz-go/actions/runs/3257547284/jobs/5348876552. Hope it's not a heisenbug.

LE: I just ran it again on another branch in my fork and it passed: https://github.com/mihaitodor/franz-go/actions/runs/3261604515/jobs/5356889078

LE2: Definitely a heisenbug 😕 I ran it again here. Guess this log entry is relevant:

2022-10-17T01:13:42.3034436Z  [2022-10-17 01:13:26,007] ERROR [ReplicaManager broker=1] Error processing append operation on partition 48ed629636ee415e86260963f079acbdd4d0e6f29f8d20e230715d8a5933f99e-19 (kafka.server.ReplicaManager)
2022-10-17T01:13:42.3035125Z  org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producer 46 at offset 808 in partition 48ed629636ee415e86260963f079acbdd4d0e6f29f8d20e230715d8a5933f99e-19: 0 (incoming seq. number), 2516 (current end sequence number)

I wonder if it might be due to Kafka taking too long to start... I didn't add any sort of health check on the container, since it looks like the tests try to reconnect a bunch of times before giving up. I'm surprised it didn't reproduce on localhost. I might try it again a few more times to see.

@twmb
Copy link
Owner

twmb commented Oct 17, 2022

I'll see if I can recreate this locally -- a sequence number reset can happen in a few scenarios, but they shouldn't be happening in this test.

@mihaitodor
Copy link
Contributor Author

Since Kafka is running in Zookeeperless mode, I wonder if it might have some rough corners. Hopefully not.

@twmb
Copy link
Owner

twmb commented Oct 17, 2022

It looks like Kafka isn't respecting its 5-wide idempotency window. I'm checking to see if something in Bitnami's image specifically changed it, or what. From local reproducing,

  • send partition 13 sequence # start 0
  • send partition 13 sequence # start 701
  • failure, retry the above two
  • receive OOOSN because Kafka is not de-duplicating the repeated messages

@twmb
Copy link
Owner

twmb commented Oct 17, 2022

I think this is bringing up an issue that has always been possible in Kafka, but is now visible with whatever changes they're making in Kraft. Apparently, the first produce request's sequence number does not need to start at 0. That's problematic:

  • client creates topic foo, kafka replies "sure, created"
  • client sends produce request A with 5 messages, sends produce request B with 5 messages before a response for A is received (base sequence numbers 0 and 5)
  • request A is rejected with NOT_LEADER_FOR_PARTITION because the topic is not fully initialized, request B is accepted
  • Kafka now has 5 records locally, starting at a base sequence of 5 and it is expecting 10
  • client re-sends request A and B
  • request A is rejected as out of order, since it has base sequence 0

I'll check the Kafka source to see if it truly does accept a non-zero sequence number for the first request.

It seems that Kraft replies NOT_LEADER_FOR_PARTITION for a bit, quite a bit more frequently.

@twmb
Copy link
Owner

twmb commented Oct 17, 2022

@mihaitodor
Copy link
Contributor Author

Very interesting, thanks for sharing that insight!

Out of curiosity, I also tried to see if I can use Confluent's Kafka container in Zookeeperless mode, since they offer some setup example here https://github.com/confluentinc/cp-all-in-one/blob/7.2.1-post/cp-all-in-one-kraft/docker-compose.yml but I guess I'm doing smth wrong with the way I'm volume mounting that silly shell script: https://github.com/mihaitodor/franz-go/actions/runs/3269542233/jobs/5377158865. GitHub has some bugs around volume mounting stuff in service containers and getting around them is very frustrating... Anyway, I guess it will behave the same as the Bitnami one based on your investigation above.

@twmb
Copy link
Owner

twmb commented Oct 18, 2022

For context (creating Kafka issue now), this is my docker compose:

version: "3.7"
services:
  kafka:
    image: bitnami/kafka:latest
    network_mode: host
    environment:
      KAFKA_ENABLE_KRAFT: yes
      KAFKA_CFG_PROCESS_ROLES: controller,broker
      KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: [email protected]:9093
      # Set this to "PLAINTEXT://127.0.0.1:9092" if you want to run this container on localhost via Docker
      KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:9092
      KAFKA_CFG_BROKER_ID: 1
      ALLOW_PLAINTEXT_LISTENER: yes
      KAFKA_KRAFT_CLUSTER_ID: XkpGZQ27R3eTl3OdTm2LYA # 16 byte base64-encoded UUID
      BITNAMI_DEBUG: true # Enable this to get more info on startup failures

How I am running this:

export KGO_TEST_RF=1
export KGO_TEST_RECORDS=50000
export KGO_LOG_LEVEL=debug

and then

docker compose down; sleep 1; docker compose up -d; sleep 5; while go test -run '(Txn|Group)' > logs ; do echo "whoo"; docker compose down; sleep 1 ; docker compose up -d ; sleep 5; done

@twmb
Copy link
Owner

twmb commented Oct 18, 2022

https://issues.apache.org/jira/browse/KAFKA-14312
https://issues.apache.org/jira/browse/KAFKA-14313

AFAICT, there is no solution to 14312.

One approach I could try is to sleep 1s after creating topics in the test.

@twmb
Copy link
Owner

twmb commented Oct 18, 2022

Another odd issue,

__transaction_state partition 33 offset 7 at [2022-10-18 11:15:37.821]
TxnMetadataKey(0) 9f87dc04dc3f4d5b15ef3072c531cf46327278307df8e149fa966462cd40c10b
TxnMetadataValue(0)
      ProducerID           41
      ProducerEpoch        0
      TimeoutMillis        120000
      State                PrepareCommit
      Topics               __consumer_offsets=>[13] e7c7d971626fbaf4bfb33975e57089167939e6acabb4c4fc534eb148462e45cc=>[4 5 12 16]  
      LastUpdateTimestamp  1666113337821
      StartTimestamp       1666113335311


__transaction_state partition 33 offset 8 at [2022-10-18 11:16:11.419]
TxnMetadataKey(0) 9f87dc04dc3f4d5b15ef3072c531cf46327278307df8e149fa966462cd40c10b
TxnMetadataValue(0)
      ProducerID           41
      ProducerEpoch        0
      TimeoutMillis        120000
      State                CompleteCommit
      Topics     
      LastUpdateTimestamp  1666113337821
      StartTimestamp       1666113335311

Kafka took 34 seconds to finish a commit within itself. This is causing duplicates, because a rebalance happened in the middle and moved partitions.

Technically, I should be using RequireStableFetchOffsets. I'm not sure why Kafka took 34 seconds to finish a transaction within itself, though -- the client goes to a lot of lengths to avoid the need for RequireStableFetchOffsets. I'll add the option and see how things go ...

Opened https://issues.apache.org/jira/browse/KAFKA-14315 to track this

twmb added a commit that referenced this pull request Oct 18, 2022
twmb added a commit that referenced this pull request Oct 18, 2022
Single-broker localhost kraft testing shows this error being returned
**much** more frequently. Rather than having a 5s timeout, we now
continue until the input context is canceled. This largely affects
ending a transaction.

We also improve the locking situation in one area. It's a bit ugly, but
the solution is better than unlocking directly at every return site.

For #223.
@twmb twmb mentioned this pull request Oct 18, 2022
twmb added a commit that referenced this pull request Oct 18, 2022
twmb added a commit that referenced this pull request Oct 18, 2022
Single-broker localhost kraft testing shows this error being returned
**much** more frequently. Rather than having a 5s timeout, we now
continue until the input context is canceled. This largely affects
ending a transaction.

We also improve the locking situation in one area. It's a bit ugly, but
the solution is better than unlocking directly at every return site.

For #223.
@twmb
Copy link
Owner

twmb commented Oct 18, 2022

I've added a bunch of commits in #224 that should address failures, and cherry picked this commit. I've looped tests for a few hours and have not seen any failures locally. Closing this and merging the other PR.

This also finally makes it worth it to enable branch protection (and require all commits to go through PRs), which is also nice because I've occasionally introduced a compile failure without testing (and immediately fixed).

Now that this is merged, I plan to address 214, 220, 221, 222 and then release 1.19. Hopefully the rest of this is quick.

@twmb twmb closed this Oct 18, 2022
@mihaitodor
Copy link
Contributor Author

That's really awesome, thank you for all this work @twmb!

@twmb
Copy link
Owner

twmb commented Oct 20, 2022

For full posterity in case I need it later,

#!/bin/bash

docker compose down
sleep 1
docker compose up -d

until rpk topic list; do
	sleep 1
done

while go test -run Txn > logs; do
	echo "whoo"
	docker compose down
	sleep 1
	docker compose up -d
	until rpk topic list; do
		sleep 1
	done
done

Also if you want to help with #225, I need to change it to expose a port ;)
Turns out redpanda nightly does work, so long as I disable unsafe behavior in tests.

mihaitodor added a commit to mihaitodor/connect that referenced this pull request Oct 21, 2022
Some bug fixes [went in](https://github.com/twmb/franz-go/blob/master/CHANGELOG.md#v190)
following the PR I raised here: twmb/franz-go#223.

Since Travis was kind enough to also fix the old ARM builds, I
thought it should be fine to re-enable them by reverting 414cf1f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants