Tasks don't appear to get killed correctly #1757

dalemyers · 2018-08-10T10:36:18Z

Agent Version and Platform

2.138.4 on OS X

VSTS Type and Version

office.visualstudio.com

What's not working?

Tasks do not appear to be killed correctly.

We have a situation where we are running a Python script in a task. This Python scripts uses subprocess to call out to a command line application. If someone cancels the build while this is running, then it appears that the command line application gets killed, resources (such as temporary files) as wiped, but our Python script is left running for a little while longer, which causes it to get into an inconsistent state.

I say appears because there are no logs in the VSTS interface confirming this, however, the code path it goes through appears to be impossible unless it does go this route and just doesn't flush the output (which isn't unreasonable since there are problems with VSTS flushing output already).

So the flow of our task is something like this:

Start running Python script
That script starts running a command line process which logs to a file
The build gets cancelled
The command line process gets killed
Any files, etc. get wipes (including the log file from step 2)
The Python script detects the failure of the command and logs to stdout (also log files, but these get wiped too).
The Python script tries to upload the log file from step 2 so we can see what happened, but the file doesn't exist so it throws an exception.
We catch that exception elsewhere in the Python script and add a comment to the PR with this information.

So, basically, the Python script keeps running while everything else is wiped as far as I can tell, and this causes us to comment on the PR with failure information, despite the fact that the task is cancelled.

Side note: If we had a flag like IS_CANCELLED or something, I could read it and avoid posting to the PR if it's in that state and we encounter an error, but that's a little hacky.

Agent and Worker's Diagnostic Logs

Unavailable due to using hosted queue

Related Repositories

This may be better suited to the bash task in the tasks repo, but since I don't know where the issue is occurring, I figured starting general was better.

The text was updated successfully, but these errors were encountered:

TingluoHuang · 2018-08-10T17:46:37Z

@dalemyers during cancellation, the agent will send Ctrl-C to the child process, wait for 10 sec, if the process still running send another Ctrl-Break, wait for 5 sec, if the process still running, kill the process tree.

the agent only wiped AGENT_TEMPDIRECTORY at the end of the entire build/release job, so i am not sure what is delete the file.

Can you share a link to your build?

dalemyers · 2018-08-10T17:48:57Z

We only see this after a build has been cancelled by pushing a new update to the branch (cancelling the old build on the PR and kicking off the new one). I'll need to wait until I see it, otherwise I don't know which builds this happened to.

dalemyers · 2018-08-24T09:31:35Z

This seems to be happening less, with no changes on our end. Something weird is going on, I just don't know what.

dalemyers · 2018-09-10T15:32:59Z

We're now seeing a new issue on top of this which is that when a task is stopped, most of the time the build just isn't cancelled at all now. It just keeps going. If it's a build policy though, then the PR can see that it has failed, and let's you requeue, but the original build keeps going.

bryanmacfarlane · 2018-09-21T02:37:58Z

Ting asked for a link to the build above. Can you send some logs and / or _diag logs to us?

dalemyers · 2018-09-21T09:08:54Z

Here's one: https://office.visualstudio.com/Outlook%20Mobile/_build/results?buildId=1071729&_a=summary&view=logs

I killed this when the "Build Release" task in the "Build Release Job" phase was at 4:30. 5 minutes is the default timeout, so worst case scenario it should be killed at 9:30. You can see that it has continued on doing it's own thing anyway. More than that, it just keeps going on to the next tasks.

This never used to be the case. It started a week or two before I opened this ticket.

dalemyers · 2018-11-05T13:03:27Z

I'm still seeing this happening. When I kill a task, it's continuing to run. If I need the agent, I have to remote into the machine and kill whatever underlying process is running.

ptrrssll · 2018-11-07T00:49:41Z

We run the unified agent on top of the VS test agent. Deployments controlled by the Release process in TFS2017 Update 2. Intermittantly we have process kill failures that cause the whole test job to fail and not return any TRX test result file. The C# assemblies include Selenium and spawn the chromedriver and chrome in this example.

Problems start with the entry:
[2018-11-06 21:06:55Z INFO Worker] Cancellation/Shutdown message received.

We don't know why the TFS Release process would send this to the job???

Log is attached below:

...
[2018-11-06 17:00:43Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'b580c21b-16b2-4f91-b944-43265063eb4e', success rate: 1/1.
[2018-11-06 21:06:55Z INFO Worker] Cancellation/Shutdown message received.
[2018-11-06 21:06:55Z INFO ExpressionManager] Evaluating: succeeded()
[2018-11-06 21:06:55Z INFO ExpressionManager] Result: False
[2018-11-06 21:06:55Z INFO StepsRunner] Cancel current running step.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Sending CTRL_C to process 1204.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Successfully send CTRL_C to process 1204.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after CTRL_C signal fired.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Ignore Ctrl+C to current process.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Finished process 1204 with exit code -1073741510, and elapsed time 10:06:35.0535079.
[2018-11-06 21:06:55Z INFO StepsRunner] Updated step result: SucceededWithIssues
[2018-11-06 21:06:55Z INFO StepsRunner] Update job result with current step result 'SucceededWithIssues'.
[2018-11-06 21:06:55Z INFO StepsRunner] Current state: job state = 'Canceled'
[2018-11-06 21:06:55Z INFO JobRunner] Total accessible running process: 76.
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 1164 (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Terminate orphan process: pid (1164) (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 5892 (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Terminate orphan process: pid (5892) (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 4948 (chromedriver)
[2018-11-06 21:06:55Z WARN JobRunner] Ignore exception during read process environment variables: Only part of a ReadProcessMemory or WriteProcessMemory request was completed
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3868 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (3868) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 6764 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (6764) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3804 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (3804) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 4816 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (4816) (chrome)
[2018-11-06 21:06:56Z ERR JobRunner] Catch exception during orphan process cleanup.
[2018-11-06 21:06:56Z ERR JobRunner] System.ComponentModel.Win32Exception (5): Access is denied
at System.Diagnostics.Process.Kill()
at Microsoft.VisualStudio.Services.Agent.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 4192 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (4192) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 5188 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (5188) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 1804 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (1804) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 272 (WmiApSrv)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Access is denied
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3744 (WmiPrvSE)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Access is denied
[2018-11-06 21:06:56Z INFO JobRunner] Job result after all job steps finish: Canceled
[2018-11-06 21:06:56Z INFO JobRunner] Completing the job execution context.
[2018-11-06 21:06:56Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.
[2018-11-06 21:06:56Z INFO JobRunner] Shutting down the job server queue.
[2018-11-06 21:06:56Z INFO JobServerQueue] Fire signal to shutdown all queues.
[2018-11-06 21:06:56Z INFO ProcessInvokerWrapper] Process exit successfully.
[2018-11-06 21:06:56Z INFO ProcessInvokerWrapper] Process cancelled successfully through Ctrl+C/SIGINT.
[2018-11-06 21:06:57Z INFO JobServerQueue] All queue process task stopped.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'b580c21b-16b2-4f91-b944-43265063eb4e', success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'cbc475da-ac2c-44d1-ab81-2ecff424c177', success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] Web console line queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] Uploading 1 files in one shot.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] File upload queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] Timeline update queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained.
[2018-11-06 21:06:57Z INFO TempDirectoryManager] Cleaning agent temp folder: E:\Agent_work_temp
[2018-11-06 21:06:57Z INFO JobRunner] Raising job completed event.

TingluoHuang · 2018-11-09T21:15:01Z

@dalemyers make sure your phase condition require "succeed()"
ex: build_condition: and(and(succeed(), eq(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'WIP'), ne(variables['Build.Reason'], 'PullRequest')))

TingluoHuang · 2018-11-09T21:20:05Z

https://docs.microsoft.com/en-us/azure/devops/pipelines/process/conditions?view=vsts&tabs=yaml#examples

dalemyers · 2018-11-10T13:08:22Z

Yes, our phases are set that way. The problem is that when a build is cancelled, the phase isn't cancelled. It just runs to completion. The following phases do not run though.

TingluoHuang · 2018-11-11T01:28:43Z

@dalemyers make sure your condition is always contains and(succeed(), <your condition>.
i checked your definition, the phase condition you had will still evaluate to true on Cancellation which will cause the current phase still runs.

dalemyers · 2018-11-12T10:43:09Z

I think I'm misunderstanding. We have 2 main phases that run during a build. The first is "Build and Test". It does the vast majority of the work, taking 20 minutes out of a 25 minute build. When we cancel a build a minute after this has started, it will run through to the end of the phase. This is despite having a 5 minute cancellation timeout.

The next phase is "Distribute". If the build is cancelled during the "Build and Test" phase, this phase does not run.

TingluoHuang · 2018-11-12T13:41:51Z

@dalemyers what's the phase condition on your Build and Test phase?

dalemyers · 2018-11-12T14:39:21Z

or(in(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'DEV', 'STAGE', 'DOGFOOD', 'PROD'), eq(variables['Build.Reason'], 'PullRequest'))

But it is already running when we click cancel.

TingluoHuang · 2018-11-12T15:09:34Z

on build cancel, the system will re-evaluate phase condition and check whether it should let the phase continue running. Similar to task condition, you have an always run phase to do clean up even on cancellation.

and(succeeded(), or(in(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'DEV', 'STAGE', 'DOGFOOD', 'PROD'), eq(variables['Build.Reason'], 'PullRequest'))) should fix your problem.

dalemyers · 2018-11-12T15:27:37Z

Oh. That wasn't clear at all. Did I miss that in the documentation?

TingluoHuang · 2018-11-12T15:52:09Z

i can't find a clear doc either, @vtbassmatt for Doc feedback. :)

vtbassmatt · 2018-11-12T22:18:42Z

@andyjlewis can you get this in the docs? Boiling it down: when a run is cancelled, we re-evaluate the condition on the running job. If the user has written a condition, our default condition (succeeded()) no longer applies, and they may be surprised that the job keeps running. They should and() their custom condition with succeeded() if they want the job to actually stop on cancellation. (This feature exists so that you can write custom cleanup steps for handling cancellation within the same job.)

dalemyers · 2018-11-13T10:24:59Z

I've just added this and tested it and it's working correctly. I would never have suspected that was the issue in a million years. Thanks for the help!

@vtbassmatt I'm curious about this though:

This feature exists so that you can write custom cleanup steps for handling cancellation within the same job.

Is there any documentation on this?

vtbassmatt · 2018-11-13T11:03:08Z

Is there any documentation on this?

Not as such, and that's what I asked Andy to doc. The idea is you could write something like:

jobs:
- job: MyJob
  condition: successOrFailure()
  steps:
  - task: Foo@1
    displayName: A task that will need cleaning up
  - task: Bar@1
    displayName: A task I might want to cancel
  - task: FooCleanup@1
    condition: successOrFailure()

You want FooCleanup to run even if the job is cancelled. So, you opt the job and that task into staying alive even on failure.

dalemyers · 2018-11-13T11:32:58Z

Ah, got it. Thanks!

marionzr · 2021-04-06T17:43:37Z

Hi. I'm facing a similar problem. Could someone check what could be wrong?

Bellow is the condition inside the Custom condition

AND (succeeded, MyCondition)
MyCondition => OR (EQ(variable[x], 'true'), NOT(CONTAINS(variable[y], 'a_value')))

and(succeeded(), or(eq(variables['IgnorePullRequestTags'], 'true'), not(contains(variables['PullRequestTags.Value'], 'TestsSkipProfessionalApi'))))

Even after pressing cancel the builds did not stop, even after the cancel timeout.

bryanmacfarlane closed this as completed Feb 15, 2019

DaRosenberg mentioned this issue Aug 13, 2022

BashV3 task incorrectly sends SIGTERM to child process instead of SIGINT microsoft/azure-pipelines-tasks#16731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks don't appear to get killed correctly #1757

Tasks don't appear to get killed correctly #1757

dalemyers commented Aug 10, 2018

TingluoHuang commented Aug 10, 2018

dalemyers commented Aug 10, 2018

dalemyers commented Aug 24, 2018

dalemyers commented Sep 10, 2018

bryanmacfarlane commented Sep 21, 2018

dalemyers commented Sep 21, 2018

dalemyers commented Nov 5, 2018

ptrrssll commented Nov 7, 2018 •

edited

Loading

TingluoHuang commented Nov 9, 2018

TingluoHuang commented Nov 9, 2018

dalemyers commented Nov 10, 2018

TingluoHuang commented Nov 11, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

vtbassmatt commented Nov 12, 2018

dalemyers commented Nov 13, 2018

vtbassmatt commented Nov 13, 2018 •

edited

Loading

dalemyers commented Nov 13, 2018

marionzr commented Apr 6, 2021 •

edited

Loading

Tasks don't appear to get killed correctly #1757

Tasks don't appear to get killed correctly #1757

Comments

dalemyers commented Aug 10, 2018

Agent Version and Platform

VSTS Type and Version

What's not working?

Agent and Worker's Diagnostic Logs

Related Repositories

TingluoHuang commented Aug 10, 2018

dalemyers commented Aug 10, 2018

dalemyers commented Aug 24, 2018

dalemyers commented Sep 10, 2018

bryanmacfarlane commented Sep 21, 2018

dalemyers commented Sep 21, 2018

dalemyers commented Nov 5, 2018

ptrrssll commented Nov 7, 2018 • edited Loading

TingluoHuang commented Nov 9, 2018

TingluoHuang commented Nov 9, 2018

dalemyers commented Nov 10, 2018

TingluoHuang commented Nov 11, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

dalemyers commented Nov 12, 2018

TingluoHuang commented Nov 12, 2018

vtbassmatt commented Nov 12, 2018

dalemyers commented Nov 13, 2018

vtbassmatt commented Nov 13, 2018 • edited Loading

dalemyers commented Nov 13, 2018

marionzr commented Apr 6, 2021 • edited Loading

ptrrssll commented Nov 7, 2018 •

edited

Loading

vtbassmatt commented Nov 13, 2018 •

edited

Loading

marionzr commented Apr 6, 2021 •

edited

Loading