Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot cancel the build while waiting for a lock #70

Open
jfharden opened this issue May 13, 2024 · 4 comments · Fixed by alphagov/pool-resource#3
Open

Cannot cancel the build while waiting for a lock #70

jfharden opened this issue May 13, 2024 · 4 comments · Fixed by alphagov/pool-resource#3
Labels

Comments

@jfharden
Copy link

jfharden commented May 13, 2024

Describe the bug

When trying to claim a lock which is already claimed, and therefore waiting for the lock to be released, the task does not respond correctly to cancel requests, such as a job timeout, or a cancel in the UI of concourse, or issuing a cancel via the cli.

The only way I find I can terminate the job is to hijack the container and kill a running ssh-agent process in the container.

Reproduction steps

  1. Using concourse 7.11.2 with containerd runtime
  2. Run 2 jobs trying to claim the same lock using private key auth
  3. Once a lock has been acquired by 1 job, try and cancel the second job in the concourse UI
  4. Observe that the claim task job never gets terminated and the job doesn't fail

I think very likely related

  1. Using concourse 7.11.2 with containerd runtime
  2. Configure 2 jobs which take 20 minutes to complete, and have a claim lock step with a 10 minute timeout. The locks must be configured to use private key auth
  3. Run the 2 jobs
  4. Once a lock has been acquired and held for over 10 minutes notice the timeout does not apply
  5. Wait for the running job with the lock to complete
  6. Notice the job still waiting for the lock now says claimed, but hangs forever
  7. Hijack the job which is hanging, Notice there are no tasks running in the container other than your hijack session and an ss-agent. If you kill the ssh-agent the job will immediately enter "Timeout reached" status and the lock is now deadlocked.

Expected behavior

  1. Clicking to cancel the job in the concourse UI actually cancels it
  2. Timeouts are respected

Additional context

No response

@jfharden jfharden added the bug label May 13, 2024
@jfharden
Copy link
Author

I've been doing some debugging, here's what I've found so far:

When the container initially launches and the lock is already claimed, so a waiting state is happening, these processes are visible (in all these examples I'm purposefully leaving out my bash and ps commands from the hijacked session):

UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root           8       0  0 11:29 ?        00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root          20       1  0 11:29 ?        00:00:00 ssh-agent
root          33       8  0 11:29 ?        00:00:00 /opt/go/out /tmp/build/put

If you attempt to abort the task via the concourse UI, only /bin/sh /opt/resource/out /tmp/build/put is killed (PID 8 above):

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root          20       1  0 11:29 ?        00:00:00 ssh-agent
root          33       1  0 11:29 ?        00:00:00 /opt/go/out /tmp/build/put

If you just allow this to run, it will eventually actually claim the lock, but the job in concourse will hang forever printing no more output:
Screenshot 2024-05-15 at 12 32 13

However the /opt/go/out process will have terminated:

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root          20       1  0 11:29 ?        00:00:00 ssh-agent

When I kill the ssh-agent (kill 20) the task actually finishes and I'm only left with gdn-init:

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init

Screenshot 2024-05-15 at 12 36 13

However if the task managed to claim the lock, the lock is now deadlocked until you intervene

@jfharden
Copy link
Author

It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12

I don't see why it wouldn't and I can't replicate it locally.

@jfharden
Copy link
Author

jfharden commented May 15, 2024

It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12

I don't see why it wouldn't and I can't replicate it locally.

I've setup a pipeline using a lock pool with username & password (github Personal Access Token) and the behaviour changes slightly.

While the job is waiting for the lock you still can't abort the job, it continues waiting, however once it claims the lock (assuming it eventually can) the job does then terminate as interrupted
Screenshot 2024-05-15 at 16 01 31

The expected processes are running:

root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:03 ?        00:00:00 /tmp/gdn-init
root           7       0  0 15:03 ?        00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root          25       7  0 15:03 ?        00:00:00 /opt/go/out /tmp/build/put

If I try to cancel the job through the UI again the bash script exits (PID 7 above) but the go command is still running

root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:03 ?        00:00:00 /tmp/gdn-init
root          25       1  0 15:03 ?        00:00:00 /opt/go/out /tmp/build/put

If in the hijacked session I kill the go process then the task correctly exits (kill 25 in the above example)
Screenshot 2024-05-15 at 16 07 19

@jfharden
Copy link
Author

Sorry this shouldn't have closed, I merged a PR into a fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant