-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks don't appear to get killed correctly #1757
Comments
@dalemyers during cancellation, the agent will send Ctrl-C to the child process, wait for 10 sec, if the process still running send another Ctrl-Break, wait for 5 sec, if the process still running, kill the process tree. the agent only wiped AGENT_TEMPDIRECTORY at the end of the entire build/release job, so i am not sure what is delete the file. Can you share a link to your build? |
We only see this after a build has been cancelled by pushing a new update to the branch (cancelling the old build on the PR and kicking off the new one). I'll need to wait until I see it, otherwise I don't know which builds this happened to. |
This seems to be happening less, with no changes on our end. Something weird is going on, I just don't know what. |
We're now seeing a new issue on top of this which is that when a task is stopped, most of the time the build just isn't cancelled at all now. It just keeps going. If it's a build policy though, then the PR can see that it has failed, and let's you requeue, but the original build keeps going. |
Ting asked for a link to the build above. Can you send some logs and / or _diag logs to us? |
Here's one: https://office.visualstudio.com/Outlook%20Mobile/_build/results?buildId=1071729&_a=summary&view=logs I killed this when the "Build Release" task in the "Build Release Job" phase was at 4:30. 5 minutes is the default timeout, so worst case scenario it should be killed at 9:30. You can see that it has continued on doing it's own thing anyway. More than that, it just keeps going on to the next tasks. This never used to be the case. It started a week or two before I opened this ticket. |
I'm still seeing this happening. When I kill a task, it's continuing to run. If I need the agent, I have to remote into the machine and kill whatever underlying process is running. |
We run the unified agent on top of the VS test agent. Deployments controlled by the Release process in TFS2017 Update 2. Intermittantly we have process kill failures that cause the whole test job to fail and not return any TRX test result file. The C# assemblies include Selenium and spawn the chromedriver and chrome in this example. Problems start with the entry: We don't know why the TFS Release process would send this to the job??? Log is attached below: ... |
@dalemyers make sure your phase condition require "succeed()" |
Yes, our phases are set that way. The problem is that when a build is cancelled, the phase isn't cancelled. It just runs to completion. The following phases do not run though. |
@dalemyers make sure your condition is always contains |
I think I'm misunderstanding. We have 2 main phases that run during a build. The first is "Build and Test". It does the vast majority of the work, taking 20 minutes out of a 25 minute build. When we cancel a build a minute after this has started, it will run through to the end of the phase. This is despite having a 5 minute cancellation timeout. The next phase is "Distribute". If the build is cancelled during the "Build and Test" phase, this phase does not run. |
@dalemyers what's the phase condition on your |
But it is already running when we click cancel. |
on build cancel, the system will re-evaluate phase condition and check whether it should let the phase continue running. Similar to task condition, you have an always run phase to do clean up even on cancellation.
|
Oh. That wasn't clear at all. Did I miss that in the documentation? |
i can't find a clear doc either, @vtbassmatt for Doc feedback. :) |
@andyjlewis can you get this in the docs? Boiling it down: when a run is cancelled, we re-evaluate the condition on the running job. If the user has written a condition, our default condition ( |
I've just added this and tested it and it's working correctly. I would never have suspected that was the issue in a million years. Thanks for the help! @vtbassmatt I'm curious about this though:
Is there any documentation on this? |
Not as such, and that's what I asked Andy to doc. The idea is you could write something like: jobs:
- job: MyJob
condition: successOrFailure()
steps:
- task: Foo@1
displayName: A task that will need cleaning up
- task: Bar@1
displayName: A task I might want to cancel
- task: FooCleanup@1
condition: successOrFailure() You want FooCleanup to run even if the job is cancelled. So, you opt the job and that task into staying alive even on failure. |
Ah, got it. Thanks! |
Hi. I'm facing a similar problem. Could someone check what could be wrong? Bellow is the condition inside the Custom condition AND (succeeded, MyCondition) and(succeeded(), or(eq(variables['IgnorePullRequestTags'], 'true'), not(contains(variables['PullRequestTags.Value'], 'TestsSkipProfessionalApi')))) Even after pressing cancel the builds did not stop, even after the cancel timeout. |
Agent Version and Platform
2.138.4 on OS X
VSTS Type and Version
office.visualstudio.com
What's not working?
Tasks do not appear to be killed correctly.
We have a situation where we are running a Python script in a task. This Python scripts uses
subprocess
to call out to a command line application. If someone cancels the build while this is running, then it appears that the command line application gets killed, resources (such as temporary files) as wiped, but our Python script is left running for a little while longer, which causes it to get into an inconsistent state.I say appears because there are no logs in the VSTS interface confirming this, however, the code path it goes through appears to be impossible unless it does go this route and just doesn't flush the output (which isn't unreasonable since there are problems with VSTS flushing output already).
So the flow of our task is something like this:
So, basically, the Python script keeps running while everything else is wiped as far as I can tell, and this causes us to comment on the PR with failure information, despite the fact that the task is cancelled.
Side note: If we had a flag like IS_CANCELLED or something, I could read it and avoid posting to the PR if it's in that state and we encounter an error, but that's a little hacky.
Agent and Worker's Diagnostic Logs
Unavailable due to using hosted queue
Related Repositories
This may be better suited to the bash task in the tasks repo, but since I don't know where the issue is occurring, I figured starting general was better.
The text was updated successfully, but these errors were encountered: