Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync semaphore: Lock not released during retry backoff #9875

Closed
2 of 3 tasks
Konkrad opened this issue Oct 21, 2022 · 1 comment · Fixed by #11210
Closed
2 of 3 tasks

Sync semaphore: Lock not released during retry backoff #9875

Konkrad opened this issue Oct 21, 2022 · 1 comment · Fixed by #11210

Comments

@Konkrad
Copy link

Konkrad commented Oct 21, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When using the semaphore in combination with a retry strategy, I expect that while the retry backoff other tasks can be processed.

I use argo to process multiple workflows, each having multiple tasks. A semaphore is used to limit the spawning of pods in parallel. Furthermore, a retry policy is used with backoff as the pods rely on other APIs which might restrict access or are not available. The observation is that while a task is sitting out the backoff duration, the semaphore lock is not released and other tasks can't be executed in the meantime. In my opinion, this is not an expected behavior, as it undermines the use of semaphores to regulate the spawning of containers.

To my knowledge it worked before as expected, and I suspect that the following MR reintroduced the current unexpected behavior (I did not test on a previous version) #9063

image

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: v1
kind: ConfigMap
metadata:
 name: argo-parallel
data:
  tasks: "1"
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism
spec:
  entrypoint: parallelism
  templates:
  - name: parallelism
    steps:
    - - name: retry-backoff
        template: retry-backoff
        withItems:
        - this
        - workflow
        - should
        - try
        - other
        - tasks
        - during
        - backoff

  - name: retry-backoff
    retryStrategy:
      limit: "10"
      backoff:
        duration: "6h"
        factor: "2"
        maxDuration: "24h"
    synchronization:
        semaphore:
          configMapKeyRef:
            name: argo-parallel
            key: tasks
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]

Logs from the workflow controller

time="2022-10-21T09:23:51.231Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.242Z" level=info msg="Updated phase  -> Running" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.242Z" level=info msg="Steps node parallelism5rpvs initialized Running" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.242Z" level=info msg="StepGroup node parallelism5rpvs-13883845 initialized Running" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.253Z" level=info msg="dev-acquisition/ConfigMap/argo-parallel/tasks acquired by dev-acquisition/parallelism5rpvs/parallelism5rpvs-374547325. Lock availability: 0/1" semaphore=dev-acquisition/ConfigMap/argo-parallel/tasks
time="2022-10-21T09:23:51.253Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(0:this) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.253Z" level=info msg="Retry node parallelism5rpvs-374547325 initialized Running" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.253Z" level=info msg="Pod node parallelism5rpvs-4158946316 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.289Z" level=info msg="Created pod: parallelism5rpvs[0].retry-backoff(0:this)(0) (parallelism5rpvs-retry-backoff-4158946316)" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.296Z" level=info msg="Retry node parallelism5rpvs-2831477905 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.307Z" level=info msg="Retry node parallelism5rpvs-1458317794 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.316Z" level=info msg="Retry node parallelism5rpvs-4122838961 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.323Z" level=info msg="Retry node parallelism5rpvs-2090753915 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.328Z" level=info msg="Retry node parallelism5rpvs-1391302162 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.336Z" level=info msg="Retry node parallelism5rpvs-1404086644 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.346Z" level=info msg="Retry node parallelism5rpvs-309128010 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.346Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.346Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.346Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:23:51.360Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494524 workflow=parallelism5rpvs
time="2022-10-21T09:24:01.291Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.291Z" level=info msg="Task-result reconciliation" namespace=dev-acquisition numObjs=0 workflow=parallelism5rpvs
time="2022-10-21T09:24:01.291Z" level=info msg="node changed" namespace=dev-acquisition new.message= new.phase=Succeeded new.progress=0/1 nodeID=parallelism5rpvs-4158946316 old.message= old.phase=Pending old.progress=0/1 workflow=parallelism5rpvs
time="2022-10-21T09:24:01.301Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(0:this) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.301Z" level=info msg="node parallelism5rpvs-374547325 phase Running -> Succeeded" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.301Z" level=info msg="node parallelism5rpvs-374547325 finished: 2022-10-21 09:24:01.301904041 +0000 UTC" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.301Z" level=info msg="Lock has been released by dev-acquisition/parallelism5rpvs/parallelism5rpvs-374547325. Available locks: 1" semaphore=dev-acquisition/ConfigMap/argo-parallel/tasks
time="2022-10-21T09:24:01.308Z" level=info msg="dev-acquisition/ConfigMap/argo-parallel/tasks acquired by dev-acquisition/parallelism5rpvs/parallelism5rpvs-2831477905. Lock availability: 0/1" semaphore=dev-acquisition/ConfigMap/argo-parallel/tasks
time="2022-10-21T09:24:01.308Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(1:workflow) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.308Z" level=info msg="Pod node parallelism5rpvs-2784362216 initialized Pending" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.333Z" level=info msg="Created pod: parallelism5rpvs[0].retry-backoff(1:workflow)(0) (parallelism5rpvs-retry-backoff-2784362216)" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.386Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.386Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.386Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:01.415Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494620 workflow=parallelism5rpvs
time="2022-10-21T09:24:01.421Z" level=info msg="cleaning up pod" action=labelPodCompleted key=dev-acquisition/parallelism5rpvs-retry-backoff-4158946316/labelPodCompleted
time="2022-10-21T09:24:02.302Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:02.303Z" level=info msg="Task-result reconciliation" namespace=dev-acquisition numObjs=0 workflow=parallelism5rpvs
time="2022-10-21T09:24:02.303Z" level=info msg="node changed" namespace=dev-acquisition new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=parallelism5rpvs-2784362216 old.message= old.phase=Pending old.progress=0/1 workflow=parallelism5rpvs
time="2022-10-21T09:24:02.309Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(1:workflow) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:02.343Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:02.343Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:02.343Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:02.356Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494642 workflow=parallelism5rpvs
time="2022-10-21T09:24:13.182Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.182Z" level=info msg="Task-result reconciliation" namespace=dev-acquisition numObjs=0 workflow=parallelism5rpvs
time="2022-10-21T09:24:13.182Z" level=info msg="node changed" namespace=dev-acquisition new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=parallelism5rpvs-2784362216 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=parallelism5rpvs
time="2022-10-21T09:24:13.193Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(1:workflow) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.193Z" level=info msg="node parallelism5rpvs-2831477905 message: Backoff for 6 hours 0 minutes" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.227Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.227Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.227Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:13.245Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494727 workflow=parallelism5rpvs
time="2022-10-21T09:24:13.251Z" level=info msg="cleaning up pod" action=labelPodCompleted key=dev-acquisition/parallelism5rpvs-retry-backoff-2784362216/labelPodCompleted
time="2022-10-21T09:24:23.284Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.284Z" level=info msg="Task-result reconciliation" namespace=dev-acquisition numObjs=0 workflow=parallelism5rpvs
time="2022-10-21T09:24:23.295Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(1:workflow) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.295Z" level=info msg="node parallelism5rpvs-2831477905 message: Backoff for 6 hours 0 minutes" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.329Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.329Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.329Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:24:23.344Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494727 workflow=parallelism5rpvs
time="2022-10-21T09:44:23.285Z" level=info msg="Processing workflow" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.285Z" level=info msg="Task-result reconciliation" namespace=dev-acquisition numObjs=0 workflow=parallelism5rpvs
time="2022-10-21T09:44:23.311Z" level=info msg="Node parallelism5rpvs[0].retry-backoff(1:workflow) acquired synchronization lock" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.311Z" level=info msg="node parallelism5rpvs-2831477905 message: Backoff for 6 hours 0 minutes" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.346Z" level=info msg="Workflow step group node parallelism5rpvs-13883845 not yet completed" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.347Z" level=info msg="TaskSet Reconciliation" namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.347Z" level=info msg=reconcileAgentPod namespace=dev-acquisition workflow=parallelism5rpvs
time="2022-10-21T09:44:23.359Z" level=info msg="Workflow update successful" namespace=dev-acquisition phase=Running resourceVersion=63494727 workflow=parallelism5rpvs

Logs from in your workflow's wait container

time="2022-10-21T09:24:03.265Z" level=info msg="Starting Workflow Executor" version=v3.4.1
time="2022-10-21T09:24:03.267Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-10-21T09:24:03.267Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev-acquisition podName=parallelism5rpvs-retry-backoff-2784362216 template="{\"name\":\"retry-backoff\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"python:alpine3.6\",\"command\":[\"python\",\"-c\"],\"args\":[\"import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)\"],\"resources\":{}},\"retryStrategy\":{\"limit\":\"10\",\"backoff\":{\"duration\":\"6h\",\"factor\":\"2\",\"maxDuration\":\"24h\"}},\"synchronization\":{\"semaphore\":{\"configMapKeyRef\":{\"name\":\"argo-parallel\",\"key\":\"tasks\"}}}}" version="&Version{Version:v3.4.1,BuildDate:2022-10-01T15:03:42Z,GitCommit:0546fef0b096d84c9e3362d2b241614e743ebe97,GitTag:v3.4.1,GitTreeState:clean,GoVersion:go1.18.6,Compiler:gc,Platform:linux/amd64,}"
time="2022-10-21T09:24:03.267Z" level=info msg="Starting deadline monitor"
time="2022-10-21T09:24:05.268Z" level=info msg="Main container completed" error="<nil>"
time="2022-10-21T09:24:05.268Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-10-21T09:24:05.268Z" level=info msg="No output parameters"
time="2022-10-21T09:24:05.268Z" level=info msg="No output artifacts"
time="2022-10-21T09:24:05.268Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2022-10-21T09:24:05.268Z" level=info msg="Alloc=6768 TotalAlloc=12032 Sys=19410 NumGC=4 Goroutines=6"
time="2022-10-21T09:23:53.225Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-10-21T09:23:53.225Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev-acquisition podName=parallelism5rpvs-retry-backoff-4158946316 template="{\"name\":\"retry-backoff\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"python:alpine3.6\",\"command\":[\"python\",\"-c\"],\"args\":[\"import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)\"],\"resources\":{}},\"retryStrategy\":{\"limit\":\"10\",\"backoff\":{\"duration\":\"6h\",\"factor\":\"2\",\"maxDuration\":\"24h\"}},\"synchronization\":{\"semaphore\":{\"configMapKeyRef\":{\"name\":\"argo-parallel\",\"key\":\"tasks\"}}}}" version="&Version{Version:v3.4.1,BuildDate:2022-10-01T15:03:42Z,GitCommit:0546fef0b096d84c9e3362d2b241614e743ebe97,GitTag:v3.4.1,GitTreeState:clean,GoVersion:go1.18.6,Compiler:gc,Platform:linux/amd64,}"
time="2022-10-21T09:23:53.225Z" level=info msg="Starting deadline monitor"
time="2022-10-21T09:23:55.226Z" level=info msg="Main container completed" error="<nil>"
time="2022-10-21T09:23:55.226Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-10-21T09:23:55.226Z" level=info msg="No output parameters"
time="2022-10-21T09:23:55.226Z" level=info msg="No output artifacts"
time="2022-10-21T09:23:55.226Z" level=info msg="Deadline monitor stopped"
time="2022-10-21T09:23:55.226Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2022-10-21T09:23:55.226Z" level=info msg="Alloc=6621 TotalAlloc=12017 Sys=19410 NumGC=4 Goroutines=6"
@sarabala1979 sarabala1979 added the P3 Low priority label Oct 31, 2022
@stale
Copy link

stale bot commented Jan 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

@stale stale bot added the problem/stale This has not had a response in some time label Jan 22, 2023
@agilgur5 agilgur5 added area/mutex-semaphore and removed problem/stale This has not had a response in some time labels Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants