Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent instance fails to connect to master despite port being open #669

Open
pkaramol opened this issue Jan 28, 2020 · 28 comments
Open

Agent instance fails to connect to master despite port being open #669

pkaramol opened this issue Jan 28, 2020 · 28 comments

Comments

@pkaramol
Copy link

pkaramol commented Jan 28, 2020

Installing jenkin on GKE using the official helm chart.

Have used jnlp images with tags both 3.27-1 and 3.40-1

When starting a simple (shell execution) job, the agent pod, although it starts running, it gets terninated with error.
Its error logs are the following:

jenkins-agent-5j324 jnlp java.io.IOException: Failed to connect to http://jenkins-inception.jenkins.svc.cluster.local:8080/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
jenkins-agent-5j324 jnlp 	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:196)
jenkins-agent-5j324 jnlp 	at hudson.remoting.Engine.innerRun(Engine.java:523)
jenkins-agent-5j324 jnlp 	at hudson.remoting.Engine.run(Engine.java:474)
jenkins-agent-5j324 jnlp Caused by: java.net.ConnectException: Connection refused (Connection refused)
jenkins-agent-5j324 jnlp 	at java.net.PlainSocketImpl.socketConnect(Native Method)
jenkins-agent-5j324 jnlp 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
jenkins-agent-5j324 jnlp 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
jenkins-agent-5j324 jnlp 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
jenkins-agent-5j324 jnlp 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
jenkins-agent-5j324 jnlp 	at java.net.Socket.connect(Socket.java:589)
jenkins-agent-5j324 jnlp 	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
jenkins-agent-5j324 jnlp 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
jenkins-agent-5j324 jnlp 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
jenkins-agent-5j324 jnlp 	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
jenkins-agent-5j324 jnlp 	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
jenkins-agent-5j324 jnlp 	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
jenkins-agent-5j324 jnlp 	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
jenkins-agent-5j324 jnlp 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
jenkins-agent-5j324 jnlp 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
jenkins-agent-5j324 jnlp 	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
jenkins-agent-5j324 jnlp 	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:193)
jenkins-agent-5j324 jnlp 	... 2 more
jenkins-agent-5j324 jnlp

I have created a test pod within the same master/agent namespace and no connectivity issue seems to exist:

/ # dig +short jenkins-inception.jenkins.svc.cluster.local
10.14.203.189
/ # nc -zv -w 3 jenkins-inception.jenkins.svc.cluster.local 8080
jenkins-inception.jenkins.svc.cluster.local (10.14.203.189:8080) open
/ # curl http://jenkins-inception.jenkins.svc.cluster.local:8080/jenkins/tcpSlaveAgentListener/


  Jenkins

Environment:

  • cloud provider: GCP
  • master tag: lts
  • agent tag: 3.27-1 and 3.40-1
  • helm version:
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
  • kubernetes version:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-gke.14", GitCommit:"56d89863d1033f9668ddd6e1c1aea81cd846ef88", GitTreeState:"clean", BuildDate:"2019-11-07T19:12:22Z", GoVersion:"go1.12.11b4", Compiler:"gc", Platform:"linux/amd64"}
  • istio version: 1.4.0
@timmyers
Copy link

timmyers commented Feb 28, 2020

I believe this is happening because the envoy proxy is taking some time to set things up and the jnlp container tries to make a connection while this is still happening. I have had similar issues with recent versions of istio. Unfortunately I don't have a fix yet.

One solution would be for jnlp-slave to retry this connection instead of giving up on the first failure.

@pkaramol
Copy link
Author

I can also confirm that this occurs on a GKE cluster using istio 1.4.0 but NOT on another one using an older version of istio, e.g. 1.1.15

@aspring
Copy link

aspring commented Apr 12, 2020

Following up on @timmyers comment this is exactly what I was observing and built a custom jnlp image that leverages wait-for-it to make sure the pod is able to connect to Jenkins prior to launching jenkins-agent. This solved the connectivity issue and from my testing its about a 3s delay on our cluster for the connection to be available.

@slide slide changed the title Slave instance fails to connect to master despite port being open Agent instance fails to connect to master despite port being open Jun 8, 2020
@abhishekkarigar
Copy link

guys ,
i am facing this issue, when i am running jenkins service ( windows service 127.0.0.1:8080 ) outside minikube cluster

@yogesh9391
Copy link

@aspring could you please share the details like how you made custom image and how you have added wait.

@deepan10
Copy link

guys ,
i am facing this issue, when i am running jenkins service ( windows service 127.0.0.1:8080 ) outside minikube cluster

If your slave is outside Cluster, then you have use NodePort for Master to expose service.
After that you can connect the slave from outside cluster to Master which is inside cluster.

@anthonyGuo
Copy link

anthonyGuo commented Nov 9, 2020

I'm facing this issue! Is there any idea except modifying jnlp images ?
istio: 1.6.8
jnlp: 4.3-4

I tried to modify the configMap for jenkins-agent: add "sleep 10; jenkins-agent" to command, but not work.
< command >sh -c " sleep 10; jenkins-agent " < /command >

logs:

SEVERE: Failed to connect to https://xxxx-jenkins.xxxx.svc:8080/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
java.io.IOException: Failed to connect to https://xxxx-jenkins.xxxx.svc:8080/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:217)

@839928622
Copy link

839928622 commented Jun 1, 2021

facing same issue on istio 1.2.0, if u run jenkins and jenkins slave on pure kubernetes, everything works fine .

root@ubuntu:~# kubectl logs po/jenkins-slave-jrz8f -n jenkins

Warning: JnlpProtocol3 is disabled by default, use JNLP_PROTOCOL_OPTS to alter the behavior Jun 01, 2021 1:00:07 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: jenkins-slave-jrz8f Jun 01, 2021 1:00:07 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jun 01, 2021 1:00:07 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 3.20 Jun 01, 2021 1:00:07 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/remoting as a remoting work directory Both error and output logs will be printed to /home/jenkins/agent/remoting Jun 01, 2021 1:00:07 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.jenkins.svc.cluster.local/] Jun 01, 2021 1:00:07 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: Failed to connect to http://jenkins.jenkins.svc.cluster.local/tcpSlaveAgentListener/: Connection refused (Connection refused) java.io.IOException: Failed to connect to http://jenkins.jenkins.svc.cluster.local/tcpSlaveAgentListener/: Connection refused (Connection refused) at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:192) at hudson.remoting.Engine.innerRun(Engine.java:518) at hudson.remoting.Engine.run(Engine.java:469) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at sun.net.www.http.HttpClient.New(HttpClient.java:339) at sun.net.www.http.HttpClient.New(HttpClient.java:357) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:189) ... 2 more

@mb250315
Copy link

I am getting the same issue

Oct 27, 2021 4:23:18 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: agent-pkznn
Oct 27, 2021 4:23:18 PM hudson.remoting.jnlp.Main$CuiListener
INFO: Jenkins agent is running in headless mode.
Oct 27, 2021 4:23:18 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.11
Oct 27, 2021 4:23:18 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
Oct 27, 2021 4:23:18 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
Oct 27, 2021 4:23:18 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://fin-orchestration-jenkins-service.fssre.svc.cluster.local:8080/]
Oct 27, 2021 4:23:18 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://fin-orchestration-jenkins-service.fssre.svc.cluster.local:8080/tcpSlaveAgentListener/: Connection refused (Connection refused)
java.io.IOException: Failed to connect to http://fin-orchestration-jenkins-service.fssre.svc.cluster.local:8080/tcpSlaveAgentListener/: Connection refused (Connection refused)
at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:214)
at hudson.remoting.Engine.innerRun(Engine.java:724)
at hudson.remoting.Engine.run(Engine.java:540)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.base/java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.base/java.net.Socket.connect(Unknown Source)
at java.base/sun.net.NetworkClient.doConnect(Unknown Source)
at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
at java.base/sun.net.www.http.HttpClient.openServer(Unknown Source)
at java.base/sun.net.www.http.HttpClient.(Unknown Source)
at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
at java.base/sun.net.www.http.HttpClient.New(Unknown Source)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:211)
... 2 more

@mb250315
Copy link

I can also confirm that this occurs on a GKE cluster using istio 1.4.0 but NOT on another one using an older version of istio, e.g. 1.1.15

I am getting it on istio 1.7.3 and GKE version 1.20.10-gke.301

@timja
Copy link
Member

timja commented Oct 27, 2021

the recommendation appears to be to add a bit of a sleep / wait-for-it / a retry.

Happy for a fix in either this repo, or in say https://github.com/jenkinsci/remoting

cc @jeffret-b

@jmcastellote
Copy link

jmcastellote commented Nov 18, 2021

Thanks @timja for the workaround. Indeed it worked for us by modifying the agent's entrypoint in k8s pod template.
Adding the following to windows jnlp agent (jenkins/inbound-agent):

    command:
    - "powershell.exe"
    args:
    - "Start-Sleep"
    - "-s"
    - "5"
    - ";"
    - "powershell.exe"
    - "-f"
    - "C:/ProgramData/Jenkins/jenkins-agent.ps1"

And it all works fine again (well... 5s slower).

just fyi, this started happening on a new EKS 1.21 cluster with mixed arm and amd instances, plus windows nodes. It only happens on the windows nodes, which have no kube-proxy and depend on vpc webhooks, so perhaps that would explain the istio-like network experience of the pod.

@hg13190
Copy link

hg13190 commented Dec 12, 2021

Thanks @timja for the workaround. Indeed it worked for us by modifying the agent's entrypoint in k8s pod template. Adding the following to windows jnlp agent (jenkins/inbound-agent):

    command:
    - "powershell.exe"
    args:
    - "Start-Sleep"
    - "-s"
    - "5"
    - ";"
    - "powershell.exe"
    - "-f"
    - "C:/ProgramData/Jenkins/jenkins-agent.ps1"

And it all works fine again (well... 5s slower).

just fyi, this started happening on a new EKS 1.21 cluster with mixed arm and amd instances, plus windows nodes. It only happens on the windows nodes, which have no kube-proxy and depend on vpc webhooks, so perhaps that would explain the istio-like network experience of the pod.

How to do this if I'm not using kubernetes? How to add sleep?

@sasha-bachurin
Copy link

Updating pod template might help as well

spec:
  containers:
  - name: jnlp
    image: jenkins/inbound-agent:4.3-4-jdk11
    command: ["/bin/sh","-c"]
    args: ["sleep 30; /usr/local/bin/jenkins-agent"]

@psimms-r7
Copy link

Thanks @timja for the workaround. Indeed it worked for us by modifying the agent's entrypoint in k8s pod template. Adding the following to windows jnlp agent (jenkins/inbound-agent):

    command:
    - "powershell.exe"
    args:
    - "Start-Sleep"
    - "-s"
    - "5"
    - ";"
    - "powershell.exe"
    - "-f"
    - "C:/ProgramData/Jenkins/jenkins-agent.ps1"

And it all works fine again (well... 5s slower).

just fyi, this started happening on a new EKS 1.21 cluster with mixed arm and amd instances, plus windows nodes. It only happens on the windows nodes, which have no kube-proxy and depend on vpc webhooks, so perhaps that would explain the istio-like network experience of the pod.

We are seeing similar issues - only for Windows nodes as well
Could we add a readiness probe to the pod template I wonder, and if so what would that look like

@dduportal
Copy link
Contributor

Thanks @timja for the workaround. Indeed it worked for us by modifying the agent's entrypoint in k8s pod template. Adding the following to windows jnlp agent (jenkins/inbound-agent):

    command:
    - "powershell.exe"
    args:
    - "Start-Sleep"
    - "-s"
    - "5"
    - ";"
    - "powershell.exe"
    - "-f"
    - "C:/ProgramData/Jenkins/jenkins-agent.ps1"

And it all works fine again (well... 5s slower).

just fyi, this started happening on a new EKS 1.21 cluster with mixed arm and amd instances, plus windows nodes. It only happens on the windows nodes, which have no kube-proxy and depend on vpc webhooks, so perhaps that would explain the istio-like network experience of the pod.

We are seeing similar issues - only for Windows nodes as well Could we add a readiness probe to the pod template I wonder, and if so what would that look like

Hi @psimms-r7 , as per https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ , readiness probes are not usable there:

The kubelet uses readiness probes to know when a container is ready to start accepting traffic.

=> the inbound-agent are connecting to Jenkins controller, not the other way around. Unless you meant a readiness probe for Jenkins controller itself in Kubernetes? (if yes then look at the helm chart values: https://github.com/jenkinsci/helm-charts/blob/48f2acfaeec059de23d5b1065757ba8bb4621e0a/charts/jenkins/VALUES_SUMMARY.md#kubernetes-health-probes).

=> You could use startup probe though (with a Kubernetes version supporing it): https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes.

@psimms-r7
Copy link

Thanks @timja for the workaround. Indeed it worked for us by modifying the agent's entrypoint in k8s pod template. Adding the following to windows jnlp agent (jenkins/inbound-agent):

    command:
    - "powershell.exe"
    args:
    - "Start-Sleep"
    - "-s"
    - "5"
    - ";"
    - "powershell.exe"
    - "-f"
    - "C:/ProgramData/Jenkins/jenkins-agent.ps1"

And it all works fine again (well... 5s slower).

just fyi, this started happening on a new EKS 1.21 cluster with mixed arm and amd instances, plus windows nodes. It only happens on the windows nodes, which have no kube-proxy and depend on vpc webhooks, so perhaps that would explain the istio-like network experience of the pod.

We are seeing similar issues - only for Windows nodes as well Could we add a readiness probe to the pod template I wonder, and if so what would that look like

Hi @psimms-r7 , as per https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ , readiness probes are not usable there:

The kubelet uses readiness probes to know when a container is ready to start accepting traffic.

=> the inbound-agent are connecting to Jenkins controller, not the other way around. Unless you meant a readiness probe for Jenkins controller itself in Kubernetes? (if yes then look at the helm chart values: https://github.com/jenkinsci/helm-charts/blob/48f2acfaeec059de23d5b1065757ba8bb4621e0a/charts/jenkins/VALUES_SUMMARY.md#kubernetes-health-probes).

=> You could use startup probe though (with a Kubernetes version supporing it): https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes.

Apologies, you're right, something like a startup probe - could we just do a curl on the agent listener?

@psimms-r7
Copy link

The error we are seeing is slightly different actually - UnknownHostException

Error

INFO: Locating server among [http://jenkins.jenkins.svc.cluster.local:8080/]
Oct 25, 2022 11:25:51 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
java.io.IOException: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:217)
	at hudson.remoting.Engine.innerRun(Engine.java:693)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: java.net.UnknownHostException: jenkins.jenkins.svc.cluster.local
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:214)
	... 2 more

I am experimenting with our custom inbound agent image and tweaking the jenkins-agent.ps1 script with the below wrapped around the start-process - this appears to improved things - Note I rarely use powershell so I am sure this can be much improved, but would something like this make sense to be merged up to master

    $attempt = 6
    $success = $false
    while ($attempt -gt 0 -and -not $success) {
        try {
            $Response = Invoke-WebRequest -UseBasicParsing -Uri "$env:JENKINS_URL/tcpSlaveAgentListener"
            if ($?) {
                Write-Host "AgentListener active"
                Start-Process -FilePath $JAVA_BIN -Wait -NoNewWindow -ArgumentList $AgentArguments
            }
            else {
                Write-Host "AgentListener failed"
            }
        }
        catch {
            $attempt--
            Start-Sleep -s 10
            Write-Host "Failed"
            Write-Host $_
        }
    }

@dduportal
Copy link
Contributor

Apologies, you're right, something like a startup probe - could we just do a curl on the agent listener?

I never played around with startup probes but it looks the right way to achieve. Your idea looks really good: startup probe to curl the Jenkins controller listener.
Alternatively, an initContainer added to the pod.

@dduportal
Copy link
Contributor

The error we are seeing is slightly different actually - UnknownHostException

Error

INFO: Locating server among [http://jenkins.jenkins.svc.cluster.local:8080/]
Oct 25, 2022 11:25:51 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
java.io.IOException: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:217)
	at hudson.remoting.Engine.innerRun(Engine.java:693)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: java.net.UnknownHostException: jenkins.jenkins.svc.cluster.local
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:214)
	... 2 more

I am experimenting with our custom inbound agent image and tweaking the jenkins-agent.ps1 script with the below wrapped around the start-process - this appears to improved things - Note I rarely use powershell so I am sure this can be much improved, but would something like this make sense to be merged up to master

    $attempt = 6
    $success = $false
    while ($attempt -gt 0 -and -not $success) {
        try {
            $Response = Invoke-WebRequest -UseBasicParsing -Uri "$env:JENKINS_URL/tcpSlaveAgentListener"
            if ($?) {
                Write-Host "AgentListener active"
                Start-Process -FilePath $JAVA_BIN -Wait -NoNewWindow -ArgumentList $AgentArguments
            }
            else {
                Write-Host "AgentListener failed"
            }
        }
        catch {
            $attempt--
            Start-Sleep -s 10
            Write-Host "Failed"
            Write-Host $_
        }
    }

The error comes from DNS resolution in your case. The UnknownHostException is pretty clear: it is NOT related to the image itself or your powershell code.

  • Could be worth it to check the DNS resolution with an interactive shell in your Jenkins Agent Windows pod: can it resolve external domain such as google.com?
  • Can you confirm that your Jenkins controller is running in a pod named jenkins in the namespace jenkins?
  • If you have Linux pod, can you try a Linux Jenkins agent with the same URL to see if it works with the same JENKINS_URL?

=> It reminds me of microsoft/Windows-Containers#61 (if it helps)

@abcdefstar
Copy link

abcdefstar commented Dec 13, 2022

The error we are seeing is slightly different actually - UnknownHostException

Error

INFO: Locating server among [http://jenkins.jenkins.svc.cluster.local:8080/]
Oct 25, 2022 11:25:51 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
java.io.IOException: Failed to connect to http://jenkins.jenkins.svc.cluster.local:8080/tcpSlaveAgentListener/: jenkins.jenkins.svc.cluster.local
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:217)
	at hudson.remoting.Engine.innerRun(Engine.java:693)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: java.net.UnknownHostException: jenkins.jenkins.svc.cluster.local
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:214)
	... 2 more

I am experimenting with our custom inbound agent image and tweaking the jenkins-agent.ps1 script with the below wrapped around the start-process - this appears to improved things - Note I rarely use powershell so I am sure this can be much improved, but would something like this make sense to be merged up to master

    $attempt = 6
    $success = $false
    while ($attempt -gt 0 -and -not $success) {
        try {
            $Response = Invoke-WebRequest -UseBasicParsing -Uri "$env:JENKINS_URL/tcpSlaveAgentListener"
            if ($?) {
                Write-Host "AgentListener active"
                Start-Process -FilePath $JAVA_BIN -Wait -NoNewWindow -ArgumentList $AgentArguments
            }
            else {
                Write-Host "AgentListener failed"
            }
        }
        catch {
            $attempt--
            Start-Sleep -s 10
            Write-Host "Failed"
            Write-Host $_
        }
    }

Hi @psimms-r7 , Were you able to solve this issue? Im facing the exact same issue in my cluster as well

@psimms-r7
Copy link

psimms-r7 commented Dec 13, 2022

Hi @psimms-r7 , Were you able to solve this issue? Im facing the exact same issue in my cluster as well

@abcdefstar

I put that snippet of code into the jenkins-agent.ps1 script and bundled that into our custom jnlp image overwriting the original, which seems to make it more reliable, I haven't seen that issue since

@abcdefstar
Copy link

Hi @psimms-r7 , Were you able to solve this issue? Im facing the exact same issue in my cluster as well

@abcdefstar

I put that snippet of code into the jenkins-agent.ps1 script and bundled that into our custom jnlp image overwriting the original, which seems to make it more reliable, I haven't seen that issue since

Thank you so much!! Let me give it a try..

@abcdefstar
Copy link

abcdefstar commented Jun 21, 2023

@jawadqur The issue is mostly seen in windows node. The init container option works for linux.

@RiyazM3
Copy link

RiyazM3 commented Nov 16, 2023

this issue occurs on a OKE cluster (v1.25.12) using istio 1.15.1, To get it to work had to disable istio from the agent namespace.

lemeurherve referenced this issue in lemeurherve/jenkinsci-docker-inbound-agent Nov 19, 2023
Fix more issues with password expiry
@lemeurherve lemeurherve transferred this issue from jenkinsci/docker-inbound-agent Jan 16, 2024
@felipecrs
Copy link
Contributor

Should a retry mechanism be implemented on top of the agent.jar? In the entrypoint?

I have found references that the agent should be retried in case of failures:

https://issues.jenkins.io/browse/JENKINS-49956?focusedId=331180&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-331180

So, I wonder why docker-agent doesn't do it.

@felipecrs
Copy link
Contributor

Answering to myself:

  1. Probably because the agent.jar itself is responsible for the reconnection, and that's why it has a parameter -noReconnect to disable the reconnection mechanism
  2. The reference I used is very old, perhaps back then the agent.jar was not prepared for reconnecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests