Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hosted agent "lost communication with the server" #2261

Closed
letmaik opened this issue May 19, 2019 · 37 comments
Closed

Hosted agent "lost communication with the server" #2261

letmaik opened this issue May 19, 2019 · 37 comments
Assignees

Comments

@letmaik
Copy link
Member

letmaik commented May 19, 2019

Agent Version and Platform

Version of your agent? '2.150.3' according to the "Initialize job" task

OS of the machine running the agent? OSX, Windows

Azure DevOps Type and Version

dev.azure.com

What's not working?

https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=324
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=373
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=374

A job stopped with:

##[Error 1]
The agent: Hosted macOS High Sierra 5 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

##[Error 1]
The agent: Hosted Windows 2019 with VS2019 2 lost communication with the server. Verify the machine is running and has a healthy network connection. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Considering this was on a Hosted agent, the way it just stops the whole pipeline makes you feel a bit helpless. Shouldn't it somehow retry? As a developer this issue is distracting me from real work and it would be better if this is handled "behind the scenes".

EDIT: This is particularly annoying if you have many jobs in a build since DevOps doesn't allow you to re-run a single failed job, instead you need to re-run the whole build.

@letmaik
Copy link
Member Author

letmaik commented May 24, 2019

I observe this more frequently now, also on Windows, at random points in the job. This really makes Azure Pipelines unusable for me because you can only re-run the whole build and not individual jobs, and there's currently a good chance that one of them fails randomly.

@snap608
Copy link

snap608 commented Jun 18, 2019

This also now affect us and we are only using hosted agents. Most noticeable on macOS.

@alepauly
Copy link
Member

alepauly commented Jul 3, 2019

@letmaik, @snap608 Are you still seeing this issue? We'll look into it, sorry for the delay.

@alepauly
Copy link
Member

alepauly commented Jul 3, 2019

This also now affect us and we are only using hosted agents. Most noticeable on macOS.

@snap608 Can you send me an email to [email protected] with your Azure DevOps organization name so I can dig into it? Thanks!

@letmaik
Copy link
Member Author

letmaik commented Jul 3, 2019

@alepauly Yes, see the job named "macOS CMake: Debug, serial" at https://dev.azure.com/wrf-cmake/WRF/_build/results?buildId=483. Please ignore any other jobs.

@alepauly
Copy link
Member

alepauly commented Jul 3, 2019

@alepauly Yes, see the job named "macOS CMake: Debug, serial" at https://dev.azure.com/wrf-cmake/WRF/_build/results?buildId=483. Please ignore any other jobs.

Thanks @letmaik, actually found that earlier and engaged the Mac team to look into it. I'll let you know as soon as I have info. I'm still trying to figure out what might have been happening with the Windows side but didn't find a recent example of a job on a Windows machine failing for that reason. I might be able to dig earlier failures but if you see a recent one please let me know.

@letmaik
Copy link
Member Author

letmaik commented Jul 16, 2019

@alepauly Windows seems fine, I haven't seen one recently. I still see a lot of macOS failures. Nearly all my builds fail because at least one of the macOS jobs looses connection. Could someone figure out a cause for this?

@alepauly
Copy link
Member

@letmaik Thanks for confirming, I did engage the proper team to look into the MacOS disconnects. I'll follow up with them, apologies for the delay.

@arroyc
Copy link

arroyc commented Jul 28, 2019

I'm using self hosted Linux pool and I see this often times. Additionally it will be nice if you support predefined agent variables (e.g agent.name, agent.jobstatus) using which we can place demands if the agent is enabled or disabled (like agent.status). Refer to #2367

@letmaik
Copy link
Member Author

letmaik commented Aug 14, 2019

@alepauly I stumbled upon a Windows disconnect which looks a bit funny as the job appears finished: https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543

@alepauly
Copy link
Member

@alepauly I stumbled upon a Windows disconnect which looks a bit funny as the job appears finished: https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543

Thanks @letmaik . We'll investigate

@pasangh
Copy link

pasangh commented Aug 14, 2019

@letmaik are you using self hosted agent? It seems like a UI thing - it looks like it succeeded.

@letmaik
Copy link
Member Author

letmaik commented Aug 15, 2019

@pasangh Microsoft-hosted.

@pasangh
Copy link

pasangh commented Aug 15, 2019

@letmaik yeah it seems like a network glitch. If possible, can you retry the build and I am expecting it to go good

@letmaik
Copy link
Member Author

letmaik commented Aug 15, 2019

@pasangh Whether it was a glitch or not, Azure Pipelines should be resilient to that. It ended up in a confusing/impossible state and that shouldn't happen. I would appreciate if you dig deeper, it doesn't help me if this particular build goes green when retrying. I want to limit my total time that I spend on Azure Pipelines issues :)

@pasangh
Copy link

pasangh commented Aug 15, 2019

@letmaik absolutely. I'll dig into this right away.

@pasangh
Copy link

pasangh commented Aug 15, 2019

Adding @stephenmichaelf to look into this since this looks like an issue with the agent.

@stephenmichaelf following findings from my end:

  1. When I look at the builds here https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543 and search for "Agent:" I see it says Azure Pipelines for all the builds except the one where it failed. That makes me think it was a self-hosted machine where that build was run.

  2. The requests show success from the machine end so it seems like the agent connection was lost (because of some network glitch) right after the jobs succeeded but I'll let you investigate here.

@letmaik
Copy link
Member Author

letmaik commented Aug 16, 2019

@pasangh It's not a self-hosted agent. This is an open source project and I'm using the free Microsoft-hosted agents.

@pasangh
Copy link

pasangh commented Aug 19, 2019

@letmaik I have filed a bug against the UI team in charge of this experience to investigate the behavior in case of such network glitches and improve the scenario.

@bryanmacfarlane
Copy link
Contributor

@zachariahcox - this may be a job abandon (which can happen for many reasons). Note this just doesn't happen because of network glitches. It can happen if a machine reboots, etc. What it really means is we haven't heard from the agent.

@letmaik
Copy link
Member Author

letmaik commented Sep 4, 2019

@alepauly It's been two months. Any news on the macOS "lost communication" issues? I still receive them regularly.

@sdg002
Copy link

sdg002 commented Oct 7, 2019

We are facing a similar problem . We have a large number of unit tests and it takes over an hour for the tests to run on local desktop.

The agent: Hosted Agent lost communication with the server. Verify the machine is running and has a healthy network connection

AzureVSTestLostComm

@alepauly
Copy link
Member

alepauly commented Nov 8, 2019

@alepauly It's been two months. Any news on the macOS "lost communication" issues? I still receive them regularly.

@letmaik somehow I just saw this, apologies. I imagine you are still seeing them, the team in charge has been chasing those but so far unable to figure out what's going on.

@zachariahcox
Copy link
Contributor

@letmaik, whoops! I did not notice that this was assigned to me.
Please send me a link to a recent build that shows this issue at my microsoft email address zacox at microsoft.com.

As mentioned above, many things can cause the agent to "lose connection", including reboots (not likely for hosted builds) or running the machine out of physical resources with the work payload.

Hosted agents are also only available for a limited period of time -- if your build exceeds this time limit we will send a cancel event. After that, if your work is still running, we will reclaim the machine.

@letmaik
Copy link
Member Author

letmaik commented Nov 9, 2019

@zachariahcox I have supplied 3 different lost-connection builds over the last 6 months in this issue. I'm not investing any more time into this. Note that time limits are not the issue.

@alepauly
Copy link
Member

@letmaik is this still happening regularly? I know the Mac teams spent some time fixing issues that would lead to this type of error. Apologies for the lack and disjointedness of the responses here.

@ghost
Copy link

ghost commented May 1, 2020

Using Azure DevOps 2019 on premise agent version (2.153.1) having same problems with agent machines are still connected but job lose connection and fails.
I did network adapter testing using the Intel driver and it says Link is disconnected while the machine is pinging as in the network (running on windows).

@zachariahcox can you elaborate on what kind of workloads can be the reason to kill the Link during a job?

I would also like to know if there is any settings to control the connection properties please.

@sopgenorth
Copy link

The pipelines I manage for work have also been having this issue fairly regularly with a self-hosted Linux build agent.

It's especially frustrating since the logs don't seem to point to any obvious issues, and the Linux build agent remains responsive and not heavily loaded.

@CharmanderJieniJieni
Copy link

We are having same issue useing Azure Pipeline Hosted agent with Windows pool. It is long running job and so far all failed at 70+
minute with this error

@HaGGi13
Copy link

HaGGi13 commented Aug 5, 2020

We have the same issues at work. (on-premise systems)

Azure DevOps Server 2019 system:

  • Windows Server 2012 R2 v6.3 (Build 9600)
  • Azure DevOps Server 2019 Update 1 (17.153.29207.5)

Agent System:

  • Windows Server 2019 DC Edition v1809 (Build 17763.1339)
  • Agent version 2.172.0

After analyzing a lot of Azure DevOps Server log files and Event Log we could not find anything, we decided to removed the agent and installing a fresh one. It was working perfectly. So we have removed all other agents which had this issue and re-installed them too. Well, after aprox a week one of the new and freshly installed agents have the same issue again.

The executed builds are relatively short, they need between 3 and 15 minutes. Some times a build on the agent fails, and a directly afterwards build was finished successfully. I could not see any pattern or reason why it could fail.
It fails on agents which are installed Azure DevOps Server itself and it fails on agents which have an own server.
This issue is very sporadic.

Any advice to solve this issue, if it is a problem on our side?

@alepauly alepauly removed their assignment Aug 5, 2020
@zachariahcox
Copy link
Contributor

howdy folks! this issue is kind of all over the place and the original issue was about hosted agents. I'll close this one.

To provide some context, the azure pipelines agent is composed of two processes: agent.listener and agent.worker (one of these per step in the job). The listener is responsible for reporting that workers are still making progress. If the agent.listener is unable to communicate with the server for 10 minutes (we attempt to communicate every minute), we assume something has Gone Wrong and abandon the job.

So, if you're running a private machine, anything that can interfere with the listener's ability to communicate with our server is going to be a problem.

Among the issues i've seen are anti-virus programs identifying it as a threat, local proxies acting up in various ways, the physical machine running out of memory or disk space (quite common), the machine rebooting unexpectedly, someone ctrl+c'ing the whole listener process, the work payload being run at a way higher priority than the listener (thus "starving" the listener out), unit tests shutting down network adapters (quite common), having too many agents at normal priority on the same machine so they starve each other out (@HaGGi13, kinda shot-in-the-dark, but this might be the case for you?), etc

If you think you're seeing an issue that cannot be explained by any of the above (and nothing jumps out at you from the _diag logs folder), please file an issue at https://azure.microsoft.com/en-us/support/devops/

Thanks!

@nahmed23
Copy link

@zachariahcox we are also having the same issue in our azure hosted pipeline with the "We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information"

@fahrudina
Copy link

@zachariahcox we are also having the same issue in azure hosted agent "[error]We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process starves it for CPU, or blocks its network access can cause this error. For more information". It is possible to upgrade hardware specification aggent. currently agent specification:

Hardware Overview:
Model Name: Apple device
Model Identifier: VMware7,1
Processor Speed: 3.33 GHz
Number of Processors: 2
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache (per Processor): 12 MB
Memory: 12 GB
Boot ROM Version: VMW71.00V.13989454.B64.1906190538
Apple ROM Info: [MS_VM_CERT/SHA1/27d66596a61c48dd3dc7216fd715126e33f59ae7]Welcome to the Virtual Machine
SMC Version (system): 2.8f0
Serial Number (system): VMxg9WoAmafW
Hardware UUID: 4203018E-580F-C1B5-9525-B745CECA79EB

@brandondev1
Copy link

Constantly seeing this issue when deploying D365 F&O builds. This happens much more often than it doesn't happen. The deployment actually succeeds, but the pipeline fails due to this issue. We are using the free MS-hosted agent.

@KrylixZA
Copy link

I have recently moved the self-hosted build agents that I manage for my organization from running as Windows Services directly on the machine into Docker containers running on a Windows base image mcr.microsoft.com/dotnet/framework/runtime:4.8.

The Dockerfile is relatively straight forward:

  1. Install VS 2019 Build Tools
  2. Install VS 2019 Test Agent tools
  3. Install NodeJS, Git, Java JDK and various other things that may be needed for our specific build cases.
  4. Copy and run the start.ps1 file defined in the Run a self-hosted agent in Docker guide for Windows.

I am using docker-compose up --detach with 2 replicas and a restart policy of any in my Docker compose file. 99% of the time the agents respond okay and complete the builds within around 1-5 minutes, depending on the project being built and whether it's a newly spun up container or not. However this issue is still present and occurs more during the nightly build timeframe than during the day - not to say it doesn't happen during the course of a day though. This isn't related to overloaded systems as there is plenty of headroom over and above even when running builds on both running containers.

What's unfortunate is that I am unable to view any build logs from the agent's work directory as everything is disposed when the agent crashes and is restarted automatically. I could look at mounting volumes but the idea of moving the agents into Docker was to get away from sharing resources on the physical server hosting them in the first place (related to this issue with the SonarQube steps here - https://community.sonarsource.com/t/make-azure-devops-build-tasks-atomic-to-the-agent/37250).

When I ran the standard Windows Service on the servers and only per server, everything was dandy, but slow. The Docker containers are much faster and more lightweight and isolated which solves the initial problem when I set out on this journey. I would love to end it up with a stable set of Docker containers in the end.

Host details:

  • Windows Server 2019 Standard
  • 6 cores, 12 threads
  • 12GB of RAM
  • 150GB drive space with about 50GB free at any given time
  • 10GB network to each host

@petermisovic
Copy link

hello, I just want to inform you, connection lost issue is not just related to Azure DevOps in cloud. It is happening as well on premise DevOps and on TFS we still run for our project. Our agents are loosing connection regularly and it really breaks CI/CD concept.

@anatolybolshakov
Copy link
Contributor

Hi @petermisovic please feel free to open new issue with more details - if you still face with a similar problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests