Hosted agent "lost communication with the server" #2261

letmaik · 2019-05-19T14:03:16Z

Agent Version and Platform

Version of your agent? '2.150.3' according to the "Initialize job" task

OS of the machine running the agent? OSX, Windows

Azure DevOps Type and Version

dev.azure.com

What's not working?

https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=324
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=373
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=374

A job stopped with:

##[Error 1]
The agent: Hosted macOS High Sierra 5 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

##[Error 1]
The agent: Hosted Windows 2019 with VS2019 2 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Considering this was on a Hosted agent, the way it just stops the whole pipeline makes you feel a bit helpless. Shouldn't it somehow retry? As a developer this issue is distracting me from real work and it would be better if this is handled "behind the scenes".

EDIT: This is particularly annoying if you have many jobs in a build since DevOps doesn't allow you to re-run a single failed job, instead you need to re-run the whole build.

letmaik · 2019-05-24T07:39:29Z

I observe this more frequently now, also on Windows, at random points in the job. This really makes Azure Pipelines unusable for me because you can only re-run the whole build and not individual jobs, and there's currently a good chance that one of them fails randomly.

snap608 · 2019-06-18T14:18:02Z

This also now affect us and we are only using hosted agents. Most noticeable on macOS.

alepauly · 2019-07-03T12:40:11Z

@letmaik, @snap608 Are you still seeing this issue? We'll look into it, sorry for the delay.

alepauly · 2019-07-03T12:45:35Z

This also now affect us and we are only using hosted agents. Most noticeable on macOS.

@snap608 Can you send me an email to [email protected] with your Azure DevOps organization name so I can dig into it? Thanks!

letmaik · 2019-07-03T15:23:38Z

@alepauly Yes, see the job named "macOS CMake: Debug, serial" at https://dev.azure.com/wrf-cmake/WRF/_build/results?buildId=483. Please ignore any other jobs.

alepauly · 2019-07-03T15:35:10Z

@alepauly Yes, see the job named "macOS CMake: Debug, serial" at https://dev.azure.com/wrf-cmake/WRF/_build/results?buildId=483. Please ignore any other jobs.

Thanks @letmaik, actually found that earlier and engaged the Mac team to look into it. I'll let you know as soon as I have info. I'm still trying to figure out what might have been happening with the Windows side but didn't find a recent example of a job on a Windows machine failing for that reason. I might be able to dig earlier failures but if you see a recent one please let me know.

letmaik · 2019-07-16T18:20:18Z

@alepauly Windows seems fine, I haven't seen one recently. I still see a lot of macOS failures. Nearly all my builds fail because at least one of the macOS jobs looses connection. Could someone figure out a cause for this?

alepauly · 2019-07-16T18:27:14Z

@letmaik Thanks for confirming, I did engage the proper team to look into the MacOS disconnects. I'll follow up with them, apologies for the delay.

arroyc · 2019-07-28T20:58:08Z

I'm using self hosted Linux pool and I see this often times. Additionally it will be nice if you support predefined agent variables (e.g agent.name, agent.jobstatus) using which we can place demands if the agent is enabled or disabled (like agent.status). Refer to #2367

letmaik · 2019-08-14T21:23:34Z

@alepauly I stumbled upon a Windows disconnect which looks a bit funny as the job appears finished: https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543

alepauly · 2019-08-14T21:27:30Z

@alepauly I stumbled upon a Windows disconnect which looks a bit funny as the job appears finished: https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543

Thanks @letmaik . We'll investigate

pasangh · 2019-08-14T21:36:01Z

@letmaik are you using self hosted agent? It seems like a UI thing - it looks like it succeeded.

letmaik · 2019-08-15T07:38:24Z

@pasangh Microsoft-hosted.

pasangh · 2019-08-15T19:11:58Z

@letmaik yeah it seems like a network glitch. If possible, can you retry the build and I am expecting it to go good

letmaik · 2019-08-15T19:55:23Z

@pasangh Whether it was a glitch or not, Azure Pipelines should be resilient to that. It ended up in a confusing/impossible state and that shouldn't happen. I would appreciate if you dig deeper, it doesn't help me if this particular build goes green when retrying. I want to limit my total time that I spend on Azure Pipelines issues :)

pasangh · 2019-08-15T20:09:44Z

@letmaik absolutely. I'll dig into this right away.

pasangh · 2019-08-15T21:22:41Z

Adding @stephenmichaelf to look into this since this looks like an issue with the agent.

@stephenmichaelf following findings from my end:

When I look at the builds here https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543 and search for "Agent:" I see it says Azure Pipelines for all the builds except the one where it failed. That makes me think it was a self-hosted machine where that build was run.
The requests show success from the machine end so it seems like the agent connection was lost (because of some network glitch) right after the jobs succeeded but I'll let you investigate here.

letmaik · 2019-08-16T07:10:42Z

@pasangh It's not a self-hosted agent. This is an open source project and I'm using the free Microsoft-hosted agents.

pasangh · 2019-08-19T19:57:34Z

@letmaik I have filed a bug against the UI team in charge of this experience to investigate the behavior in case of such network glitches and improve the scenario.

bryanmacfarlane · 2019-08-30T12:37:57Z

@zachariahcox - this may be a job abandon (which can happen for many reasons). Note this just doesn't happen because of network glitches. It can happen if a machine reboots, etc. What it really means is we haven't heard from the agent.

letmaik · 2019-09-04T17:46:02Z

@alepauly It's been two months. Any news on the macOS "lost communication" issues? I still receive them regularly.

sdg002 · 2019-10-07T13:44:18Z

We are facing a similar problem . We have a large number of unit tests and it takes over an hour for the tests to run on local desktop.

The agent: Hosted Agent lost communication with the server. Verify the machine is running and has a healthy network connection

alepauly · 2019-11-08T16:16:58Z

@alepauly It's been two months. Any news on the macOS "lost communication" issues? I still receive them regularly.

@letmaik somehow I just saw this, apologies. I imagine you are still seeing them, the team in charge has been chasing those but so far unable to figure out what's going on.

zachariahcox · 2019-11-08T16:42:53Z

@letmaik, whoops! I did not notice that this was assigned to me.
Please send me a link to a recent build that shows this issue at my microsoft email address zacox at microsoft.com.

As mentioned above, many things can cause the agent to "lose connection", including reboots (not likely for hosted builds) or running the machine out of physical resources with the work payload.

Hosted agents are also only available for a limited period of time -- if your build exceeds this time limit we will send a cancel event. After that, if your work is still running, we will reclaim the machine.

letmaik · 2019-11-09T13:00:31Z

@zachariahcox I have supplied 3 different lost-connection builds over the last 6 months in this issue. I'm not investing any more time into this. Note that time limits are not the issue.

alepauly · 2020-03-24T19:51:09Z

@letmaik is this still happening regularly? I know the Mac teams spent some time fixing issues that would lead to this type of error. Apologies for the lack and disjointedness of the responses here.

ghost · 2020-05-01T08:27:30Z

Using Azure DevOps 2019 on premise agent version (2.153.1) having same problems with agent machines are still connected but job lose connection and fails.
I did network adapter testing using the Intel driver and it says Link is disconnected while the machine is pinging as in the network (running on windows).

@zachariahcox can you elaborate on what kind of workloads can be the reason to kill the Link during a job?

I would also like to know if there is any settings to control the connection properties please.

sopgenorth · 2020-05-12T04:08:34Z

The pipelines I manage for work have also been having this issue fairly regularly with a self-hosted Linux build agent.

It's especially frustrating since the logs don't seem to point to any obvious issues, and the Linux build agent remains responsive and not heavily loaded.

CharmanderJieniJieni · 2020-07-21T16:18:30Z

We are having same issue useing Azure Pipeline Hosted agent with Windows pool. It is long running job and so far all failed at 70+
minute with this error

HaGGi13 · 2020-08-05T13:55:02Z

We have the same issues at work. (on-premise systems)

Azure DevOps Server 2019 system:

Windows Server 2012 R2 v6.3 (Build 9600)
Azure DevOps Server 2019 Update 1 (17.153.29207.5)

Agent System:

Windows Server 2019 DC Edition v1809 (Build 17763.1339)
Agent version 2.172.0

After analyzing a lot of Azure DevOps Server log files and Event Log we could not find anything, we decided to removed the agent and installing a fresh one. It was working perfectly. So we have removed all other agents which had this issue and re-installed them too. Well, after aprox a week one of the new and freshly installed agents have the same issue again.

The executed builds are relatively short, they need between 3 and 15 minutes. Some times a build on the agent fails, and a directly afterwards build was finished successfully. I could not see any pattern or reason why it could fail.
It fails on agents which are installed Azure DevOps Server itself and it fails on agents which have an own server.
This issue is very sporadic.

Any advice to solve this issue, if it is a problem on our side?

zachariahcox · 2020-08-07T18:58:08Z

howdy folks! this issue is kind of all over the place and the original issue was about hosted agents. I'll close this one.

To provide some context, the azure pipelines agent is composed of two processes: agent.listener and agent.worker (one of these per step in the job). The listener is responsible for reporting that workers are still making progress. If the agent.listener is unable to communicate with the server for 10 minutes (we attempt to communicate every minute), we assume something has Gone Wrong and abandon the job.

So, if you're running a private machine, anything that can interfere with the listener's ability to communicate with our server is going to be a problem.

Among the issues i've seen are anti-virus programs identifying it as a threat, local proxies acting up in various ways, the physical machine running out of memory or disk space (quite common), the machine rebooting unexpectedly, someone ctrl+c'ing the whole listener process, the work payload being run at a way higher priority than the listener (thus "starving" the listener out), unit tests shutting down network adapters (quite common), having too many agents at normal priority on the same machine so they starve each other out (@HaGGi13, kinda shot-in-the-dark, but this might be the case for you?), etc

If you think you're seeing an issue that cannot be explained by any of the above (and nothing jumps out at you from the _diag logs folder), please file an issue at https://azure.microsoft.com/en-us/support/devops/

Thanks!

nahmed23 · 2020-08-19T04:23:36Z

@zachariahcox we are also having the same issue in our azure hosted pipeline with the "We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information"

fahrudina · 2020-11-03T07:41:30Z

@zachariahcox we are also having the same issue in azure hosted agent "[error]We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process starves it for CPU, or blocks its network access can cause this error. For more information". It is possible to upgrade hardware specification aggent. currently agent specification:

Hardware Overview:
Model Name: Apple device
Model Identifier: VMware7,1
Processor Speed: 3.33 GHz
Number of Processors: 2
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache (per Processor): 12 MB
Memory: 12 GB
Boot ROM Version: VMW71.00V.13989454.B64.1906190538
Apple ROM Info: [MS_VM_CERT/SHA1/27d66596a61c48dd3dc7216fd715126e33f59ae7]Welcome to the Virtual Machine
SMC Version (system): 2.8f0
Serial Number (system): VMxg9WoAmafW
Hardware UUID: 4203018E-580F-C1B5-9525-B745CECA79EB

brandondev1 · 2020-12-10T16:26:17Z

Constantly seeing this issue when deploying D365 F&O builds. This happens much more often than it doesn't happen. The deployment actually succeeds, but the pipeline fails due to this issue. We are using the free MS-hosted agent.

KrylixZA · 2021-01-23T13:40:34Z

I have recently moved the self-hosted build agents that I manage for my organization from running as Windows Services directly on the machine into Docker containers running on a Windows base image mcr.microsoft.com/dotnet/framework/runtime:4.8.

The Dockerfile is relatively straight forward:

Install VS 2019 Build Tools
Install VS 2019 Test Agent tools
Install NodeJS, Git, Java JDK and various other things that may be needed for our specific build cases.
Copy and run the start.ps1 file defined in the Run a self-hosted agent in Docker guide for Windows.

I am using docker-compose up --detach with 2 replicas and a restart policy of any in my Docker compose file. 99% of the time the agents respond okay and complete the builds within around 1-5 minutes, depending on the project being built and whether it's a newly spun up container or not. However this issue is still present and occurs more during the nightly build timeframe than during the day - not to say it doesn't happen during the course of a day though. This isn't related to overloaded systems as there is plenty of headroom over and above even when running builds on both running containers.

What's unfortunate is that I am unable to view any build logs from the agent's work directory as everything is disposed when the agent crashes and is restarted automatically. I could look at mounting volumes but the idea of moving the agents into Docker was to get away from sharing resources on the physical server hosting them in the first place (related to this issue with the SonarQube steps here - https://community.sonarsource.com/t/make-azure-devops-build-tasks-atomic-to-the-agent/37250).

When I ran the standard Windows Service on the servers and only per server, everything was dandy, but slow. The Docker containers are much faster and more lightweight and isolated which solves the initial problem when I set out on this journey. I would love to end it up with a stable set of Docker containers in the end.

Host details:

Windows Server 2019 Standard
6 cores, 12 threads
12GB of RAM
150GB drive space with about 50GB free at any given time
10GB network to each host

petermisovic · 2021-09-23T08:40:43Z

hello, I just want to inform you, connection lost issue is not just related to Azure DevOps in cloud. It is happening as well on premise DevOps and on TFS we still run for our project. Our agents are loosing connection regularly and it really breaks CI/CD concept.

anatolybolshakov · 2021-09-28T11:09:33Z

Hi @petermisovic please feel free to open new issue with more details - if you still face with a similar problem.

bryanmacfarlane assigned zachariahcox Aug 30, 2019

zachariahcox assigned alepauly Nov 11, 2019

alepauly removed their assignment Aug 5, 2020

zachariahcox closed this as completed Aug 7, 2020

mumoshu mentioned this issue Apr 19, 2021

Dealing with jobs failing with "lost communication with the server" errors actions/actions-runner-controller#466

Open

strugdt mentioned this issue Feb 23, 2024

Model Training ADO pipeline FAILS after over an hour Azure/mlops-v2#123

Closed

Hosted agent "lost communication with the server" #2261

Hosted agent "lost communication with the server" #2261

Comments

letmaik commented May 19, 2019 • edited Loading

Agent Version and Platform

Azure DevOps Type and Version

What's not working?

letmaik commented May 24, 2019

snap608 commented Jun 18, 2019

alepauly commented Jul 3, 2019 • edited Loading

alepauly commented Jul 3, 2019

letmaik commented Jul 3, 2019

alepauly commented Jul 3, 2019

letmaik commented Jul 16, 2019

alepauly commented Jul 16, 2019

arroyc commented Jul 28, 2019

letmaik commented Aug 14, 2019

alepauly commented Aug 14, 2019

pasangh commented Aug 14, 2019

letmaik commented Aug 15, 2019

pasangh commented Aug 15, 2019

letmaik commented Aug 15, 2019

pasangh commented Aug 15, 2019 • edited Loading

pasangh commented Aug 15, 2019

letmaik commented Aug 16, 2019

pasangh commented Aug 19, 2019

bryanmacfarlane commented Aug 30, 2019

letmaik commented Sep 4, 2019

sdg002 commented Oct 7, 2019

alepauly commented Nov 8, 2019

zachariahcox commented Nov 8, 2019

letmaik commented Nov 9, 2019

alepauly commented Mar 24, 2020

ghost commented May 1, 2020 • edited by ghost Loading

sopgenorth commented May 12, 2020

CharmanderJieniJieni commented Jul 21, 2020

HaGGi13 commented Aug 5, 2020 • edited Loading

zachariahcox commented Aug 7, 2020

nahmed23 commented Aug 19, 2020

fahrudina commented Nov 3, 2020

brandondev1 commented Dec 10, 2020

KrylixZA commented Jan 23, 2021

petermisovic commented Sep 23, 2021

anatolybolshakov commented Sep 28, 2021

letmaik commented May 19, 2019 •

edited

Loading

alepauly commented Jul 3, 2019 •

edited

Loading

pasangh commented Aug 15, 2019 •

edited

Loading

ghost commented May 1, 2020 •

edited by ghost

Loading

HaGGi13 commented Aug 5, 2020 •

edited

Loading