-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hosted agent "lost communication with the server" #2261
Comments
I observe this more frequently now, also on Windows, at random points in the job. This really makes Azure Pipelines unusable for me because you can only re-run the whole build and not individual jobs, and there's currently a good chance that one of them fails randomly. |
This also now affect us and we are only using hosted agents. Most noticeable on macOS. |
@snap608 Can you send me an email to [email protected] with your Azure DevOps organization name so I can dig into it? Thanks! |
@alepauly Yes, see the job named "macOS CMake: Debug, serial" at https://dev.azure.com/wrf-cmake/WRF/_build/results?buildId=483. Please ignore any other jobs. |
Thanks @letmaik, actually found that earlier and engaged the Mac team to look into it. I'll let you know as soon as I have info. I'm still trying to figure out what might have been happening with the Windows side but didn't find a recent example of a job on a Windows machine failing for that reason. I might be able to dig earlier failures but if you see a recent one please let me know. |
@alepauly Windows seems fine, I haven't seen one recently. I still see a lot of macOS failures. Nearly all my builds fail because at least one of the macOS jobs looses connection. Could someone figure out a cause for this? |
@letmaik Thanks for confirming, I did engage the proper team to look into the MacOS disconnects. I'll follow up with them, apologies for the delay. |
I'm using self hosted Linux pool and I see this often times. Additionally it will be nice if you support predefined agent variables (e.g agent.name, agent.jobstatus) using which we can place demands if the agent is enabled or disabled (like agent.status). Refer to #2367 |
@alepauly I stumbled upon a Windows disconnect which looks a bit funny as the job appears finished: https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=543 |
Thanks @letmaik . We'll investigate |
@letmaik are you using self hosted agent? It seems like a UI thing - it looks like it succeeded. |
@pasangh Microsoft-hosted. |
@letmaik yeah it seems like a network glitch. If possible, can you retry the build and I am expecting it to go good |
@pasangh Whether it was a glitch or not, Azure Pipelines should be resilient to that. It ended up in a confusing/impossible state and that shouldn't happen. I would appreciate if you dig deeper, it doesn't help me if this particular build goes green when retrying. I want to limit my total time that I spend on Azure Pipelines issues :) |
@letmaik absolutely. I'll dig into this right away. |
Adding @stephenmichaelf to look into this since this looks like an issue with the agent. @stephenmichaelf following findings from my end:
|
@pasangh It's not a self-hosted agent. This is an open source project and I'm using the free Microsoft-hosted agents. |
@letmaik I have filed a bug against the UI team in charge of this experience to investigate the behavior in case of such network glitches and improve the scenario. |
@zachariahcox - this may be a job abandon (which can happen for many reasons). Note this just doesn't happen because of network glitches. It can happen if a machine reboots, etc. What it really means is we haven't heard from the agent. |
@alepauly It's been two months. Any news on the macOS "lost communication" issues? I still receive them regularly. |
@letmaik, whoops! I did not notice that this was assigned to me. As mentioned above, many things can cause the agent to "lose connection", including reboots (not likely for hosted builds) or running the machine out of physical resources with the work payload. Hosted agents are also only available for a limited period of time -- if your build exceeds this time limit we will send a cancel event. After that, if your work is still running, we will reclaim the machine. |
@zachariahcox I have supplied 3 different lost-connection builds over the last 6 months in this issue. I'm not investing any more time into this. Note that time limits are not the issue. |
@letmaik is this still happening regularly? I know the Mac teams spent some time fixing issues that would lead to this type of error. Apologies for the lack and disjointedness of the responses here. |
Using Azure DevOps 2019 on premise agent version (2.153.1) having same problems with agent machines are still connected but job lose connection and fails. @zachariahcox can you elaborate on what kind of workloads can be the reason to kill the Link during a job? I would also like to know if there is any settings to control the connection properties please. |
The pipelines I manage for work have also been having this issue fairly regularly with a self-hosted Linux build agent. It's especially frustrating since the logs don't seem to point to any obvious issues, and the Linux build agent remains responsive and not heavily loaded. |
We are having same issue useing Azure Pipeline Hosted agent with Windows pool. It is long running job and so far all failed at 70+ |
We have the same issues at work. (on-premise systems) Azure DevOps Server 2019 system:
Agent System:
After analyzing a lot of Azure DevOps Server log files and Event Log we could not find anything, we decided to removed the agent and installing a fresh one. It was working perfectly. So we have removed all other agents which had this issue and re-installed them too. Well, after aprox a week one of the new and freshly installed agents have the same issue again. The executed builds are relatively short, they need between 3 and 15 minutes. Some times a build on the agent fails, and a directly afterwards build was finished successfully. I could not see any pattern or reason why it could fail. Any advice to solve this issue, if it is a problem on our side? |
howdy folks! this issue is kind of all over the place and the original issue was about hosted agents. I'll close this one. To provide some context, the azure pipelines agent is composed of two processes: agent.listener and agent.worker (one of these per So, if you're running a private machine, anything that can interfere with the listener's ability to communicate with our server is going to be a problem. Among the issues i've seen are anti-virus programs identifying it as a threat, local proxies acting up in various ways, the physical machine running out of memory or disk space (quite common), the machine rebooting unexpectedly, someone ctrl+c'ing the whole listener process, the work payload being run at a way higher priority than the listener (thus "starving" the listener out), unit tests shutting down network adapters (quite common), having too many agents at normal priority on the same machine so they starve each other out (@HaGGi13, kinda shot-in-the-dark, but this might be the case for you?), etc If you think you're seeing an issue that cannot be explained by any of the above (and nothing jumps out at you from the Thanks! |
@zachariahcox we are also having the same issue in our azure hosted pipeline with the "We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information" |
@zachariahcox we are also having the same issue in azure hosted agent
|
Constantly seeing this issue when deploying D365 F&O builds. This happens much more often than it doesn't happen. The deployment actually succeeds, but the pipeline fails due to this issue. We are using the free MS-hosted agent. |
I have recently moved the self-hosted build agents that I manage for my organization from running as Windows Services directly on the machine into Docker containers running on a Windows base image mcr.microsoft.com/dotnet/framework/runtime:4.8. The Dockerfile is relatively straight forward:
I am using What's unfortunate is that I am unable to view any build logs from the agent's work directory as everything is disposed when the agent crashes and is restarted automatically. I could look at mounting volumes but the idea of moving the agents into Docker was to get away from sharing resources on the physical server hosting them in the first place (related to this issue with the SonarQube steps here - https://community.sonarsource.com/t/make-azure-devops-build-tasks-atomic-to-the-agent/37250). When I ran the standard Windows Service on the servers and only per server, everything was dandy, but slow. The Docker containers are much faster and more lightweight and isolated which solves the initial problem when I set out on this journey. I would love to end it up with a stable set of Docker containers in the end. Host details:
|
hello, I just want to inform you, connection lost issue is not just related to Azure DevOps in cloud. It is happening as well on premise DevOps and on TFS we still run for our project. Our agents are loosing connection regularly and it really breaks CI/CD concept. |
Hi @petermisovic please feel free to open new issue with more details - if you still face with a similar problem. |
Agent Version and Platform
Version of your agent? '2.150.3' according to the "Initialize job" task
OS of the machine running the agent? OSX, Windows
Azure DevOps Type and Version
dev.azure.com
What's not working?
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=324
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=373
https://dev.azure.com/WRF-CMake/WRF/_build/results?buildId=374
A job stopped with:
Considering this was on a Hosted agent, the way it just stops the whole pipeline makes you feel a bit helpless. Shouldn't it somehow retry? As a developer this issue is distracting me from real work and it would be better if this is handled "behind the scenes".
EDIT: This is particularly annoying if you have many jobs in a build since DevOps doesn't allow you to re-run a single failed job, instead you need to re-run the whole build.
The text was updated successfully, but these errors were encountered: