Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA Race Condition with Astarte CRD Deployment #397

Open
guicrocetti opened this issue Dec 23, 2024 · 0 comments
Open

HPA Race Condition with Astarte CRD Deployment #397

guicrocetti opened this issue Dec 23, 2024 · 0 comments
Assignees
Labels
bug Something isn't working discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated)

Comments

@guicrocetti
Copy link
Contributor

Issue Description

When deploying Horizontal Pod Autoscaler (HPA) configurations with Astarte Custom Resources, a race condition occurs where the HPA cannot properly initialize pod metrics in time. This results in the HPA setting pod replicas to 0, creating an undesirable state.

Scenarios Affected

  1. When HPA is deployed before Astarte CRDs
  2. When HPA is deployed immediately after Astarte CRDs

Current Behavior

  • HPA sets pod replicas to 0 when it cannot fetch pod metrics initially
  • Creates a race condition between:
    • HPA trying to maintain 0 replicas
    • Operator attempting to increase replica count
  • Results in service disruption and unstable pod counts

Technical Details

  • Occurs when HPA minimum replicas is set to 1
  • HPA status shows both DesiredReplicas and CurrentReplicas as 0
  • Metrics state remains unknown during initialization period

Suggested Solution

Implement a protection mechanism in the operator that would:

  1. Detect problematic HPA state:
if hpaStatus.DesiredReplicas == 0 && hpaStatus.CurrentReplicas == 0 {
    // Handle edge case
}
  1. Take corrective action:
  • Delete the problematic HPA
  • Force a minimum replica count of 1
  • Log an alert for operators to investigate

Expected Benefits

  • Prevent service disruption
  • Maintain minimum availability
  • Provide clear logging for troubleshooting
@guicrocetti guicrocetti added bug Something isn't working discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated) labels Dec 23, 2024
@guicrocetti guicrocetti self-assigned this Dec 23, 2024
guicrocetti added a commit to guicrocetti/astarte-kubernetes-operator that referenced this issue Dec 23, 2024
guicrocetti added a commit to guicrocetti/astarte-kubernetes-operator that referenced this issue Dec 23, 2024
guicrocetti added a commit to guicrocetti/astarte-kubernetes-operator that referenced this issue Dec 23, 2024
guicrocetti added a commit to guicrocetti/astarte-kubernetes-operator that referenced this issue Dec 23, 2024
guicrocetti added a commit to guicrocetti/astarte-kubernetes-operator that referenced this issue Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated)
Projects
None yet
Development

No branches or pull requests

1 participant