Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade hang in group_metadata_migration when consumer group topic doesn't already exist #4469

Closed
jcsp opened this issue Apr 28, 2022 · 0 comments · Fixed by #4474
Closed
Assignees
Labels
area/controller kind/bug Something isn't working
Milestone

Comments

@jcsp
Copy link
Contributor

jcsp commented Apr 28, 2022

Found by @VadimPlh with the new ducktape upgrade test.

  • 3 nodes on 21.11.x
  • upgrade 1 node to 22.1.x
  • The upgraded node isn't controller leader
  • It enters group_metadata_migration::start and hits the "kafka_internal/group topic does not exists, activating" path
  • This call waits for activate_feature
  • activate_feature loops until the feature is active, but it cannot be activated because only the controller leader runs the feature_manager logic for activating features, and the controller leader is a 21.11.x node that doesn't have the code.
  • The node remains in 'booting' state indefinitely

I think the fix is probably simple, to spawn a background fiber with the activate feature call, so that its loop doesn't block the startup of redpanda.

@jcsp jcsp added kind/bug Something isn't working area/controller labels Apr 28, 2022
@jcsp jcsp added this to the v22.1.1 milestone Apr 28, 2022
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue Apr 28, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
@VadimPlh VadimPlh self-assigned this Apr 28, 2022
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue Apr 29, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue Apr 29, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 1, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 1, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 4, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 4, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 4, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
VadimPlh added a commit to VadimPlh/redpanda that referenced this issue May 4, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 4, 2022
Ufter upgrading node in cluster to 22.1.x version from 22.11.x
upgraded node isn't controller leade, it enters group_metadata_migration::start
and hits the "kafka_internal/group topic does not exists, activating" path
this call waits for activate_feature,activate_feature loops until the feature is active,
but it cannot be activated because only the controller leader runs the feature_manager
logic for activating features, and the controller leader is a 21.11.x node
that doesn't have the code. The node remains in 'booting' state indefinitely

Fixes: redpanda-data#4469
(cherry picked from commit b7fb5bd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants