Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout superseding an in progress rollout gets stuck #3331

Closed
2 tasks done
meeech opened this issue Jan 26, 2024 · 5 comments
Closed
2 tasks done

Rollout superseding an in progress rollout gets stuck #3331

meeech opened this issue Jan 26, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@meeech
Copy link
Contributor

meeech commented Jan 26, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

Rollout is currently live. (Rev1)
Start a new Rollout. (Rev2)
While this rollout is in progress, start another Rollout (Rev 3)
The Rollout Rev 3 gets 'stuck' - doesn't progress.
The Rollout Rev 2 is also stuck - doesn't get spun down.

To Reproduce

(assuming you already have a stable rollout in cluster - Rev 1)
Start a rollout. (Rev 2)
While that rollout is progressing, start another rollout. (Rev 3)

Expected behavior

Rev 2 should be cancelled, and spin down. Then Rev 3 should start spinning up.

Screenshots

image

Version

Tested with 1.6.4 and 1.6.5

This is a regression, as I tested with 1.5.1 before upgrading to 1.6.4/5
With 1.5.1 this bug didn't happen.

The rollout is a basic canary rollout. No traffic routing.

Workaround/How to get out of this bad state

If you find yourself in this situation, you can get unstuck by:

Abort the rollout. This will put you back into the Stable configuration. Then hit Retry. This will start the Rollout.

So going with the above example of my repro steps Rev 3 would then proceed as normal.

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:

time="2024-01-26T20:23:30Z" level=info msg="attempting to acquire leader lease argo-rollouts/argo-rollouts-controller-lock...\n"
time="2024-01-26T20:23:47Z" level=info msg="successfully acquired lease argo-rollouts/argo-rollouts-controller-lock\n"
time="2024-01-26T20:23:47Z" level=warning msg="Controller is running."
time="2024-01-26T20:25:46Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:46Z" level=error msg="Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-01-26T20:25:46Z" level=error msg="Failed to run trigger, trigger: on-rollout-updated, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-updated' is not configured"
time="2024-01-26T20:25:46Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="roCtx.reconcile err failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset scratch-mitchell-amihod-old-timey-service-85455dcd8d: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-85455dcd8d\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476332 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="rollout syncHandler error: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset scratch-mitchell-amihod-old-timey-service-85455dcd8d: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-85455dcd8d\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset scratch-mitchell-amihod-old-timey-service-85455dcd8d: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-85455dcd8d\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-01-26T20:25:47Z" level=error msg="Failed to run trigger, trigger: on-scaling-replica-set, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-scaling-replica-set' is not configured"
time="2024-01-26T20:25:47Z" level=error msg="Notifications failed to send for eventReason ScalingReplicaSet with error: [trigger 'on-scaling-replica-set' is not configured]" event_reason=ScalingReplicaSet namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="Failed to run trigger, trigger: on-rollout-completed, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-completed' is not configured"
time="2024-01-26T20:25:47Z" level=error msg="Notifications failed to send for eventReason RolloutCompleted with error: [trigger 'on-rollout-completed' is not configured]" event_reason=RolloutCompleted namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:01Z" level=error msg="Failed to run trigger, trigger: on-rollout-updated, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-updated' is not configured"
time="2024-01-26T20:26:01Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="roCtx.reconcile err failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476396 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="rollout syncHandler error: failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-01-26T20:26:02Z" level=error msg="Failed to run trigger, trigger: on-scaling-replica-set, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-scaling-replica-set' is not configured"
time="2024-01-26T20:26:02Z" level=error msg="Notifications failed to send for eventReason ScalingReplicaSet with error: [trigger 'on-scaling-replica-set' is not configured]" event_reason=ScalingReplicaSet namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:03Z" level=error msg="Failed to run trigger, trigger: on-rollout-step-completed, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-step-completed' is not configured"
time="2024-01-26T20:26:03Z" level=error msg="Notifications failed to send for eventReason RolloutStepCompleted with error: [trigger 'on-rollout-step-completed' is not configured]" event_reason=RolloutStepCompleted namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:03Z" level=error msg="Failed to run trigger, trigger: on-rollout-paused, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-paused' is not configured"
time="2024-01-26T20:26:03Z" level=error msg="Notifications failed to send for eventReason RolloutPaused with error: [trigger 'on-rollout-paused' is not configured]" event_reason=RolloutPaused namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="Error: updating rollout revision" error="Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="roCtx.reconcile err failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476455 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="rollout syncHandler error: failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-01-26T20:26:11Z" level=error msg="Failed to run trigger, trigger: on-rollout-updated, destination: {slack davey-jones-locker}, namespace config:  : trigger 'on-rollout-updated' is not configured"
time="2024-01-26T20:26:11Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="roCtx.reconcile err error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-c7bc5b98\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476459 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="rollout syncHandler error: error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-c7bc5b98\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-c7bc5b98\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

# Logs for a specific rollout:
time="2024-01-26T20:25:46Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:46Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="roCtx.reconcile err failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset scratch-mitchell-amihod-old-timey-service-85455dcd8d: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-85455dcd8d\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476332 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="rollout syncHandler error: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset scratch-mitchell-amihod-old-timey-service-85455dcd8d: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-85455dcd8d\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="Notifications failed to send for eventReason ScalingReplicaSet with error: [trigger 'on-scaling-replica-set' is not configured]" event_reason=ScalingReplicaSet namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:25:47Z" level=error msg="Notifications failed to send for eventReason RolloutCompleted with error: [trigger 'on-rollout-completed' is not configured]" event_reason=RolloutCompleted namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:01Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="roCtx.reconcile err failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476396 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="rollout syncHandler error: failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:02Z" level=error msg="Notifications failed to send for eventReason ScalingReplicaSet with error: [trigger 'on-scaling-replica-set' is not configured]" event_reason=ScalingReplicaSet namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:03Z" level=error msg="Notifications failed to send for eventReason RolloutStepCompleted with error: [trigger 'on-rollout-step-completed' is not configured]" event_reason=RolloutStepCompleted namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:03Z" level=error msg="Notifications failed to send for eventReason RolloutPaused with error: [trigger 'on-rollout-paused' is not configured]" event_reason=RolloutPaused namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="Error: updating rollout revision" error="Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="roCtx.reconcile err failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476455 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="rollout syncHandler error: failed to getAllReplicaSetsAndSyncRevision in rolloutCanary create true: Operation cannot be fulfilled on rollouts.argoproj.io \"scratch-mitchell-amihod-old-timey-service\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="Notifications failed to send for eventReason RolloutUpdated with error: [trigger 'on-rollout-updated' is not configured]" event_reason=RolloutUpdated namespace=default rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="roCtx.reconcile err error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-c7bc5b98\": the object has been modified; please apply your changes to the latest version and try again" generation=2 namespace=default resourceVersion=17476459 rollout=scratch-mitchell-amihod-old-timey-service
time="2024-01-26T20:26:11Z" level=error msg="rollout syncHandler error: error updating replicaset in syncEphemeralMetadata: Operation cannot be fulfilled on replicasets.apps \"scratch-mitchell-amihod-old-timey-service-c7bc5b98\": the object has been modified; please apply your changes to the latest version and try again" namespace=default rollout=scratch-mitchell-amihod-old-timey-service

So overall at this point, the rollout is stuck.

I'd be interested in pairing with someone to fix this.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@meeech meeech added the bug Something isn't working label Jan 26, 2024
@meeech meeech closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2024
@meeech
Copy link
Contributor Author

meeech commented Jan 26, 2024

This was a total accident. I was working on a report and hit enter. I was goofing around with the title.

@meeech meeech changed the title The is a bug! Please Help! Rollout superseding an in progress rollout gets stuck Jan 26, 2024
@meeech meeech reopened this Jan 26, 2024
@meeech
Copy link
Contributor Author

meeech commented Jan 27, 2024

More data:

after talking with @zachaller we thought one possibility was rollbackWindow setting (i was using that, but he was not - and unable to reproduce)

So, did some more testing. It gets weirder. :D

  • no rollback window
  • steps: 25%/pause/50%/pause/100%/pause
  • set to 1 replica (with & without hpa)
  • set to 2, then 3 replicas (with & without hpa)

Rev 1 - deploys ok
Rev 2 - deploys ok, steps start
Rev 3 - stuck.

  • set to 4 replicas (with & without rollback window)
    it works

  • set to 5 replicas (with and without rollback window)
    it works

set to 50% weight, 2 replicas
Works as expected

  • Issue was introduced in 1.6.1.

@eugenepaniot
Copy link

eugenepaniot commented Jan 29, 2024

Hi @meeech, it appears we have a similar issue as discussed in #3316
As a workaround, we have set --rollout-resync=60, and it seems to work fine (or helps to mitigate the issue)

@ashutosh16
Copy link
Contributor

The changes made in the #3077 is adding an deeper check to validate the service selecton.
imo would be better to validate the behavior after reverting the commit 56586f9fc6c7b867749ad84717fa96e8486f9f96ac1f59a4a747f2083d648f24R348.
also check the total number of replicas in the new replicaset as discussed here.

@meeech
Copy link
Contributor Author

meeech commented Feb 22, 2024

Fixed by #3354

@meeech meeech closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants