Fixed inter-operation between follower fetching and incremental fetch requests #11748

mmaslankaprv · 2023-06-28T10:58:56Z

If partition contain information about preferred replica it MUST be
included in incremental fetch response. Otherwise a client would
indefinitely retry asking leader about preferred replica and
stop making progress.

Fixes: #11767

Backports Required

Release Notes

Bug Fixes

fixed not being able to clear consumer backlog when using incremental fetch requests and follower fetching
fixed returning read replica that may be unavailable from replica selector

dlex

looks good! a nit and a question, feel free to ignore.

tests/rptest/services/kafka_cli_consumer.py

dlex · 2023-06-28T22:15:25Z

tests/rptest/tests/follower_fetching_test.py

+
+        def no_lag():
+            gr = rpk.group_describe(consumer_group)
+            if gr.state != "Stable":


Trying to understang how cgroup state (stable or not) can tell about follower lag - if that is the follower lag that is checked, otherwise what lag?

If a group is in a state different than Stable no partitions are returned, this way we can not reason about the lag

bharathv

patch lgtm but I have a couple questions on client behavior just to be sure I'm understanding the issue correctly.

bharathv · 2023-06-28T22:05:36Z

src/v/kafka/server/handlers/fetch.cc

@@ -951,6 +951,10 @@ bool update_fetch_partition(
        include = true;
        partition.last_stable_offset = model::offset(resp.last_stable_offset);
    }
+    if (partition.start_offset != resp.log_start_offset) {


nice find..

q: what are repercussions of this bug? If we miss this and the subsequent request tries to fetch from an evicted offset , I see the reader throws a runtime error, is there more to it? Just making sure I understood it right.

without this a consumer will not be able to see the offset updates. I am not sure if there are any major issues, but i wanted to be consistent.

bharathv · 2023-06-28T22:17:53Z

tests/rptest/tests/follower_fetching_test.py

+        # sleep long enough to cause metadata refresh on the consumer
+        time.sleep(30)


q: I'm curious how this test is forcing the client to do 'incremental fetches' vs the other test that looks pretty similar.

This test is using Java Kafka client that by default uses incremetal fetch requests.

Signed-off-by: Michal Maslanka <[email protected]>

When replying to incremental fetch request broker must include a partition in fetch response if there is a change in its metadata. In Redpanda we were missing validation of start offset change. This may lead to situations in which follower wasn't notified about changes in the start offset. Signed-off-by: Michal Maslanka <[email protected]>

…lica If partition contain information about preferred replica it MUST be included in incremental fetch response. Otherwise a client would indefinitely retry asking leader about preferred replica and stop making progress. Signed-off-by: Michal Maslanka <[email protected]>

Signed-off-by: Michal Maslanka <[email protected]>

Added test validating if follower fetching is working correctly with incremental fetch requests. Signed-off-by: Michal Maslanka <[email protected]>

When replica is marked as offline it should not be considered a candidate to read from. Fixes: redpanda-data#11767 Signed-off-by: Michal Maslanka <[email protected]>

Added test checking if consumer can continue operating if the only eligible replica in request rack is offline. Signed-off-by: Michal Maslanka <[email protected]>

dotnwat · 2023-06-29T15:37:46Z

src/v/kafka/server/handlers/fetch.cc

@@ -951,6 +951,10 @@ bool update_fetch_partition(
        include = true;
        partition.last_stable_offset = model::offset(resp.last_stable_offset);
    }
+    if (partition.start_offset != resp.log_start_offset) {
+        include = true;


is include = true intended to be conditionally set? otherwise, it seems like the entire condition is unncessary and the start_offset can be set unconditionally. or perhaps the condition of them not being equal is interesting enough to log about?

oh maybe it's including it only if it changed?

exactly, include only if changed

mmaslankaprv · 2023-06-29T15:53:16Z

ci failures:

mmaslankaprv requested review from dotnwat and bharathv June 28, 2023 10:59

github-actions bot added the area/redpanda label Jun 28, 2023

mmaslankaprv requested a review from dlex June 28, 2023 10:59

mmaslankaprv changed the title ~~Ff incremental fix~~ Fixed inter-operation between follower fetching and incremental fetch requests Jun 28, 2023

vshtokman assigned mmaslankaprv Jun 28, 2023

dlex reviewed Jun 28, 2023

View reviewed changes

bharathv reviewed Jun 28, 2023

View reviewed changes

mmaslankaprv added 7 commits June 29, 2023 10:13

tests: do not store all messages in cli consumer

73a66c3

Signed-off-by: Michal Maslanka <[email protected]>

k/fetch: added missing read_replica_field

da0910d

Signed-off-by: Michal Maslanka <[email protected]>

tests: added follower fetching test with incremental fetches

29b510b

Added test validating if follower fetching is working correctly with incremental fetch requests. Signed-off-by: Michal Maslanka <[email protected]>

k/replica_selector: do not return not responsive replicas

6e8d0a5

When replica is marked as offline it should not be considered a candidate to read from. Fixes: redpanda-data#11767 Signed-off-by: Michal Maslanka <[email protected]>

tests: added test verifying if consumer can continue if replica is down

6dc0c9a

Added test checking if consumer can continue operating if the only eligible replica in request rack is offline. Signed-off-by: Michal Maslanka <[email protected]>

mmaslankaprv force-pushed the ff-incremental-fix branch from fe7fd1b to 6dc0c9a Compare June 29, 2023 08:14

mmaslankaprv requested review from bharathv and dlex June 29, 2023 15:28

bharathv approved these changes Jun 29, 2023

View reviewed changes

dotnwat approved these changes Jun 29, 2023

View reviewed changes

mmaslankaprv merged commit 745b12b into redpanda-data:dev Jun 29, 2023

mmaslankaprv deleted the ff-incremental-fix branch June 29, 2023 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed inter-operation between follower fetching and incremental fetch requests #11748

Fixed inter-operation between follower fetching and incremental fetch requests #11748

mmaslankaprv commented Jun 28, 2023 •

edited

Loading

dlex left a comment

dlex Jun 28, 2023

mmaslankaprv Jun 29, 2023

bharathv left a comment

bharathv Jun 28, 2023

mmaslankaprv Jun 29, 2023

bharathv Jun 28, 2023

mmaslankaprv Jun 29, 2023

dotnwat Jun 29, 2023

dotnwat Jun 29, 2023

mmaslankaprv Jun 29, 2023

mmaslankaprv commented Jun 29, 2023

		# sleep long enough to cause metadata refresh on the consumer
		time.sleep(30)

Fixed inter-operation between follower fetching and incremental fetch requests #11748

Fixed inter-operation between follower fetching and incremental fetch requests #11748

Conversation

mmaslankaprv commented Jun 28, 2023 • edited Loading

Backports Required

Release Notes

Bug Fixes

dlex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmaslankaprv commented Jun 29, 2023

mmaslankaprv commented Jun 28, 2023 •

edited

Loading