Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks don't appear to get killed correctly #1757

Closed
dalemyers opened this issue Aug 10, 2018 · 23 comments
Closed

Tasks don't appear to get killed correctly #1757

dalemyers opened this issue Aug 10, 2018 · 23 comments

Comments

@dalemyers
Copy link

Agent Version and Platform

2.138.4 on OS X

VSTS Type and Version

office.visualstudio.com

What's not working?

Tasks do not appear to be killed correctly.

We have a situation where we are running a Python script in a task. This Python scripts uses subprocess to call out to a command line application. If someone cancels the build while this is running, then it appears that the command line application gets killed, resources (such as temporary files) as wiped, but our Python script is left running for a little while longer, which causes it to get into an inconsistent state.

I say appears because there are no logs in the VSTS interface confirming this, however, the code path it goes through appears to be impossible unless it does go this route and just doesn't flush the output (which isn't unreasonable since there are problems with VSTS flushing output already).

So the flow of our task is something like this:

  1. Start running Python script
  2. That script starts running a command line process which logs to a file
  3. The build gets cancelled
  4. The command line process gets killed
  5. Any files, etc. get wipes (including the log file from step 2)
  6. The Python script detects the failure of the command and logs to stdout (also log files, but these get wiped too).
  7. The Python script tries to upload the log file from step 2 so we can see what happened, but the file doesn't exist so it throws an exception.
  8. We catch that exception elsewhere in the Python script and add a comment to the PR with this information.

So, basically, the Python script keeps running while everything else is wiped as far as I can tell, and this causes us to comment on the PR with failure information, despite the fact that the task is cancelled.

Side note: If we had a flag like IS_CANCELLED or something, I could read it and avoid posting to the PR if it's in that state and we encounter an error, but that's a little hacky.

Agent and Worker's Diagnostic Logs

Unavailable due to using hosted queue

Related Repositories

This may be better suited to the bash task in the tasks repo, but since I don't know where the issue is occurring, I figured starting general was better.

@TingluoHuang
Copy link
Contributor

@dalemyers during cancellation, the agent will send Ctrl-C to the child process, wait for 10 sec, if the process still running send another Ctrl-Break, wait for 5 sec, if the process still running, kill the process tree.

the agent only wiped AGENT_TEMPDIRECTORY at the end of the entire build/release job, so i am not sure what is delete the file.

Can you share a link to your build?

@dalemyers
Copy link
Author

We only see this after a build has been cancelled by pushing a new update to the branch (cancelling the old build on the PR and kicking off the new one). I'll need to wait until I see it, otherwise I don't know which builds this happened to.

@dalemyers
Copy link
Author

This seems to be happening less, with no changes on our end. Something weird is going on, I just don't know what.

@dalemyers
Copy link
Author

We're now seeing a new issue on top of this which is that when a task is stopped, most of the time the build just isn't cancelled at all now. It just keeps going. If it's a build policy though, then the PR can see that it has failed, and let's you requeue, but the original build keeps going.

@bryanmacfarlane
Copy link
Contributor

Ting asked for a link to the build above. Can you send some logs and / or _diag logs to us?

@dalemyers
Copy link
Author

Here's one: https://office.visualstudio.com/Outlook%20Mobile/_build/results?buildId=1071729&_a=summary&view=logs

I killed this when the "Build Release" task in the "Build Release Job" phase was at 4:30. 5 minutes is the default timeout, so worst case scenario it should be killed at 9:30. You can see that it has continued on doing it's own thing anyway. More than that, it just keeps going on to the next tasks.

This never used to be the case. It started a week or two before I opened this ticket.

@dalemyers
Copy link
Author

I'm still seeing this happening. When I kill a task, it's continuing to run. If I need the agent, I have to remote into the machine and kill whatever underlying process is running.

@ptrrssll
Copy link

ptrrssll commented Nov 7, 2018

We run the unified agent on top of the VS test agent. Deployments controlled by the Release process in TFS2017 Update 2. Intermittantly we have process kill failures that cause the whole test job to fail and not return any TRX test result file. The C# assemblies include Selenium and spawn the chromedriver and chrome in this example.

Problems start with the entry:
[2018-11-06 21:06:55Z INFO Worker] Cancellation/Shutdown message received.

We don't know why the TFS Release process would send this to the job???

Log is attached below:

...
[2018-11-06 17:00:43Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'b580c21b-16b2-4f91-b944-43265063eb4e', success rate: 1/1.
[2018-11-06 21:06:55Z INFO Worker] Cancellation/Shutdown message received.
[2018-11-06 21:06:55Z INFO ExpressionManager] Evaluating: succeeded()
[2018-11-06 21:06:55Z INFO ExpressionManager] Result: False
[2018-11-06 21:06:55Z INFO StepsRunner] Cancel current running step.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Sending CTRL_C to process 1204.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Successfully send CTRL_C to process 1204.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after CTRL_C signal fired.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Ignore Ctrl+C to current process.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2018-11-06 21:06:55Z INFO ProcessInvokerWrapper] Finished process 1204 with exit code -1073741510, and elapsed time 10:06:35.0535079.
[2018-11-06 21:06:55Z INFO StepsRunner] Updated step result: SucceededWithIssues
[2018-11-06 21:06:55Z INFO StepsRunner] Update job result with current step result 'SucceededWithIssues'.
[2018-11-06 21:06:55Z INFO StepsRunner] Current state: job state = 'Canceled'
[2018-11-06 21:06:55Z INFO JobRunner] Total accessible running process: 76.
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 1164 (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Terminate orphan process: pid (1164) (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 5892 (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Terminate orphan process: pid (5892) (conhost)
[2018-11-06 21:06:55Z INFO JobRunner] Inspecting process environment variables. PID: 4948 (chromedriver)
[2018-11-06 21:06:55Z WARN JobRunner] Ignore exception during read process environment variables: Only part of a ReadProcessMemory or WriteProcessMemory request was completed
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3868 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (3868) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 6764 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (6764) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3804 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (3804) (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 4816 (chrome)
[2018-11-06 21:06:56Z INFO JobRunner] Terminate orphan process: pid (4816) (chrome)
[2018-11-06 21:06:56Z ERR JobRunner] Catch exception during orphan process cleanup.
[2018-11-06 21:06:56Z ERR JobRunner] System.ComponentModel.Win32Exception (5): Access is denied
at System.Diagnostics.Process.Kill()
at Microsoft.VisualStudio.Services.Agent.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 4192 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (4192) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 5188 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (5188) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 1804 (chrome)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Cannot process request because the process (1804) has exited.
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 272 (WmiApSrv)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Access is denied
[2018-11-06 21:06:56Z INFO JobRunner] Inspecting process environment variables. PID: 3744 (WmiPrvSE)
[2018-11-06 21:06:56Z WARN JobRunner] Ignore exception during read process environment variables: Access is denied
[2018-11-06 21:06:56Z INFO JobRunner] Job result after all job steps finish: Canceled
[2018-11-06 21:06:56Z INFO JobRunner] Completing the job execution context.
[2018-11-06 21:06:56Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.
[2018-11-06 21:06:56Z INFO JobRunner] Shutting down the job server queue.
[2018-11-06 21:06:56Z INFO JobServerQueue] Fire signal to shutdown all queues.
[2018-11-06 21:06:56Z INFO ProcessInvokerWrapper] Process exit successfully.
[2018-11-06 21:06:56Z INFO ProcessInvokerWrapper] Process cancelled successfully through Ctrl+C/SIGINT.
[2018-11-06 21:06:57Z INFO JobServerQueue] All queue process task stopped.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'b580c21b-16b2-4f91-b944-43265063eb4e', success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to append 1 batches web console lines for record 'cbc475da-ac2c-44d1-ab81-2ecff424c177', success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] Web console line queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] Uploading 1 files in one shot.
[2018-11-06 21:06:57Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.
[2018-11-06 21:06:57Z INFO JobServerQueue] File upload queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] Timeline update queue drained.
[2018-11-06 21:06:57Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained.
[2018-11-06 21:06:57Z INFO TempDirectoryManager] Cleaning agent temp folder: E:\Agent_work_temp
[2018-11-06 21:06:57Z INFO JobRunner] Raising job completed event.

@TingluoHuang
Copy link
Contributor

@dalemyers make sure your phase condition require "succeed()"
ex: build_condition: and(and(succeed(), eq(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'WIP'), ne(variables['Build.Reason'], 'PullRequest')))

@dalemyers
Copy link
Author

Yes, our phases are set that way. The problem is that when a build is cancelled, the phase isn't cancelled. It just runs to completion. The following phases do not run though.

@TingluoHuang
Copy link
Contributor

@dalemyers make sure your condition is always contains and(succeed(), <your condition>.
i checked your definition, the phase condition you had will still evaluate to true on Cancellation which will cause the current phase still runs.

@dalemyers
Copy link
Author

I think I'm misunderstanding. We have 2 main phases that run during a build. The first is "Build and Test". It does the vast majority of the work, taking 20 minutes out of a 25 minute build. When we cancel a build a minute after this has started, it will run through to the end of the phase. This is despite having a 5 minute cancellation timeout.

The next phase is "Distribute". If the build is cancelled during the "Build and Test" phase, this phase does not run.

@TingluoHuang
Copy link
Contributor

@dalemyers what's the phase condition on your Build and Test phase?

@dalemyers
Copy link
Author

or(in(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'DEV', 'STAGE', 'DOGFOOD', 'PROD'), eq(variables['Build.Reason'], 'PullRequest'))

But it is already running when we click cancel.

@TingluoHuang
Copy link
Contributor

on build cancel, the system will re-evaluate phase condition and check whether it should let the phase continue running. Similar to task condition, you have an always run phase to do clean up even on cancellation.

and(succeeded(), or(in(dependencies.Determine_Spec.outputs['Set_Spec.CURRENT_SPEC'], 'DEV', 'STAGE', 'DOGFOOD', 'PROD'), eq(variables['Build.Reason'], 'PullRequest'))) should fix your problem.

@dalemyers
Copy link
Author

Oh. That wasn't clear at all. Did I miss that in the documentation?

@TingluoHuang
Copy link
Contributor

i can't find a clear doc either, @vtbassmatt for Doc feedback. :)

@vtbassmatt
Copy link
Member

@andyjlewis can you get this in the docs? Boiling it down: when a run is cancelled, we re-evaluate the condition on the running job. If the user has written a condition, our default condition (succeeded()) no longer applies, and they may be surprised that the job keeps running. They should and() their custom condition with succeeded() if they want the job to actually stop on cancellation. (This feature exists so that you can write custom cleanup steps for handling cancellation within the same job.)

@dalemyers
Copy link
Author

I've just added this and tested it and it's working correctly. I would never have suspected that was the issue in a million years. Thanks for the help!

@vtbassmatt I'm curious about this though:

This feature exists so that you can write custom cleanup steps for handling cancellation within the same job.

Is there any documentation on this?

@vtbassmatt
Copy link
Member

vtbassmatt commented Nov 13, 2018

Is there any documentation on this?

Not as such, and that's what I asked Andy to doc. The idea is you could write something like:

jobs:
- job: MyJob
  condition: successOrFailure()
  steps:
  - task: Foo@1
    displayName: A task that will need cleaning up
  - task: Bar@1
    displayName: A task I might want to cancel
  - task: FooCleanup@1
    condition: successOrFailure()

You want FooCleanup to run even if the job is cancelled. So, you opt the job and that task into staying alive even on failure.

@dalemyers
Copy link
Author

Ah, got it. Thanks!

@marionzr
Copy link

marionzr commented Apr 6, 2021

Hi. I'm facing a similar problem. Could someone check what could be wrong?

Bellow is the condition inside the Custom condition

AND (succeeded, MyCondition)
MyCondition => OR (EQ(variable[x], 'true'), NOT(CONTAINS(variable[y], 'a_value')))

and(succeeded(), or(eq(variables['IgnorePullRequestTags'], 'true'), not(contains(variables['PullRequestTags.Value'], 'TestsSkipProfessionalApi'))))

Even after pressing cancel the builds did not stop, even after the cancel timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants