Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed inter-operation between follower fetching and incremental fetch requests #11748

Merged
merged 7 commits into from
Jun 29, 2023

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jun 28, 2023

If partition contain information about preferred replica it MUST be
included in incremental fetch response. Otherwise a client would
indefinitely retry asking leader about preferred replica and
stop making progress.

Fixes: #11767

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Bug Fixes

  • fixed not being able to clear consumer backlog when using incremental fetch requests and follower fetching
  • fixed returning read replica that may be unavailable from replica selector

@mmaslankaprv mmaslankaprv requested a review from dlex June 28, 2023 10:59
@mmaslankaprv mmaslankaprv changed the title Ff incremental fix Fixed inter-operation between follower fetching and incremental fetch requests Jun 28, 2023
Copy link
Contributor

@dlex dlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! a nit and a question, feel free to ignore.

tests/rptest/services/kafka_cli_consumer.py Show resolved Hide resolved

def no_lag():
gr = rpk.group_describe(consumer_group)
if gr.state != "Stable":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understang how cgroup state (stable or not) can tell about follower lag - if that is the follower lag that is checked, otherwise what lag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a group is in a state different than Stable no partitions are returned, this way we can not reason about the lag

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch lgtm but I have a couple questions on client behavior just to be sure I'm understanding the issue correctly.

@@ -951,6 +951,10 @@ bool update_fetch_partition(
include = true;
partition.last_stable_offset = model::offset(resp.last_stable_offset);
}
if (partition.start_offset != resp.log_start_offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find..

q: what are repercussions of this bug? If we miss this and the subsequent request tries to fetch from an evicted offset , I see the reader throws a runtime error, is there more to it? Just making sure I understood it right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this a consumer will not be able to see the offset updates. I am not sure if there are any major issues, but i wanted to be consistent.

Comment on lines +225 to +232
# sleep long enough to cause metadata refresh on the consumer
time.sleep(30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: I'm curious how this test is forcing the client to do 'incremental fetches' vs the other test that looks pretty similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is using Java Kafka client that by default uses incremetal fetch requests.

When replying to incremental fetch request broker must include a
partition in fetch response if there is a change in its metadata. In
Redpanda we were missing validation of start offset change. This may
lead to situations in which follower wasn't notified about changes in
the start offset.

Signed-off-by: Michal Maslanka <[email protected]>
…lica

If partition contain information about preferred replica it MUST be
included in incremental fetch response. Otherwise a client would
indefinitely retry asking leader about preferred replica and
stop making progress.

Signed-off-by: Michal Maslanka <[email protected]>
Added test validating if follower fetching is working correctly with
incremental fetch requests.

Signed-off-by: Michal Maslanka <[email protected]>
When replica is marked as offline it should not be considered a
candidate to read from.

Fixes: redpanda-data#11767

Signed-off-by: Michal Maslanka <[email protected]>
Added test checking if consumer can continue operating if the only
eligible replica in request rack is offline.

Signed-off-by: Michal Maslanka <[email protected]>
@@ -951,6 +951,10 @@ bool update_fetch_partition(
include = true;
partition.last_stable_offset = model::offset(resp.last_stable_offset);
}
if (partition.start_offset != resp.log_start_offset) {
include = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is include = true intended to be conditionally set? otherwise, it seems like the entire condition is unncessary and the start_offset can be set unconditionally. or perhaps the condition of them not being equal is interesting enough to log about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh maybe it's including it only if it changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, include only if changed

@mmaslankaprv mmaslankaprv merged commit 745b12b into redpanda-data:dev Jun 29, 2023
@mmaslankaprv mmaslankaprv deleted the ff-incremental-fix branch June 29, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FF: check returned replica health
4 participants