Many jobs based on the same agent template produce many failed deployment #102

sparsick · 2022-02-14T16:16:39Z

add workaround for unexpected NullPointerException
improve logging
tested manually in Azure with container, that use Private IP Addresses and Public IP Addresses
fixes Many jobs based on the same agent template produce many failed deployment #101

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

- improve logging

timja · 2022-02-14T16:18:34Z

src/main/java/com/microsoft/jenkins/containeragents/aci/AciCloud.java

+        boolean nullIsThrown;
+        do {
+            try {
+                ip = azureResourceManager.containerGroups()


could we check the ip address for null instead and then retry rather than handling a null pointer?
is it possible the IP hasn't been allocated yet? seems quite weird though

the NullPointerException is coming from Resource Manager internally

java.lang.NullPointerException at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupImpl.initializeChildrenFromInner(ContainerGroupImpl.java:217)

src/main/java/com/microsoft/jenkins/containeragents/aci/AciCloud.java

timja · 2022-02-14T16:19:08Z

src/main/java/com/microsoft/jenkins/containeragents/aci/AciCloud.java

+                        azureResourceManager.containerGroups().getByResourceGroup(resourceGroup, agent.getNodeName());
+
+                if (containerGroup.containers().containsKey(agent.getNodeName())
+                        && containerGroup.containers().get(agent.getNodeName()).instanceView().currentState().state()


could we check the container for null instead and then retry rather than handling a null pointer?

the NullPointerException is coming from Resource Manager internally

Co-authored-by: Tim Jacomb <[email protected]>

timja · 2022-02-14T16:22:16Z

Have you tested this and it solves the issue?

sparsick · 2022-02-14T16:27:49Z

Yes, I tested it with both kind of container instances (private and public ip addresses) and it works like I expected.

Logoutput for private IP Container usage:

2022-02-14 16:00:20.352+0000 [id=143]   INFO    c.m.j.c.aci.AciCloud#waitToOnline: Waiting agent test-private-qqjz2 to online
2022-02-14 16:00:20.498+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#waitToOnline: Waiting for Agent test-private-qqjz2 produces a NullPointerException, but it is ignored.
2022-02-14 16:00:20.631+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#addIpEnv: During asking for IP address of Agent test-private-qqjz2 NullPointerException is thrown,but it is ignored.
2022-02-14 16:00:25.714+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#addIpEnv: During asking for IP address of Agent test-private-qqjz2 NullPointerException is thrown,but it is ignored.
2022-02-14 16:00:28.989+0000 [id=35]    INFO    c.m.j.c.s.ContainerOnceRetentionStrategy#done: terminating test-private-lmdhk since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=test-private#69,label=test-private-lmdhk,context=CpsStepContext[3:node]:Owner[test-private/69:test-private #69],cookie=7268d723-939c-4be3-bd57-d38fd8d6e8c2,auth=null} seems to be finished
2022-02-14 16:00:28.993+0000 [id=246]   INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#94] for test-private-lmdhk terminated: java.nio.channels.ClosedChannelException
2022-02-14 16:00:29.095+0000 [id=40]    INFO    c.m.j.c.s.ContainerOnceRetentionStrategy#done: terminating test-private-qqjz2 since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=test-private#71,label=test-private-qqjz2,context=CpsStepContext[3:node]:Owner[test-private/71:test-private #71],cookie=8da4c64f-6c3a-4f1b-a6f9-d5e49fc0cd1c,auth=null} seems to be finished
2022-02-14 16:00:29.099+0000 [id=301]   INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-14 16:00:29.099+0000 [id=301]   INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-14 16:00:29.105+0000 [id=246]   INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#94] for test-private-qqjz2 terminated: java.nio.channels.ClosedChannelException
2022-02-14 16:00:29.255+0000 [id=305]   INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-14 16:00:29.255+0000 [id=305]   INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-14 16:00:29.690+0000 [id=249]   INFO    c.m.j.c.aci.AciService#deleteAciContainerGroup: Delete ACI Container Group: test-private-qqjz2 successfully

Job output:

Started by user admin
[Pipeline] Start of Pipeline
[Pipeline] node
Still waiting to schedule task
‘Jenkins’ doesn’t have label ‘test-private’
Running on test-private-qqjz2 in /home/jenkins/workspace/test-private
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Hello)
[Pipeline] echo
Hello World
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

sparsick · 2022-02-14T16:30:00Z

@timja I will open a new issue to remove this workaround when issue Azure/azure-sdk-for-java#27083 is fixed.

jenkinsci#101: Workaround for unexpected NullPointerException

8ab0ba9

- improve logging

timja reviewed Feb 14, 2022

View reviewed changes

src/main/java/com/microsoft/jenkins/containeragents/aci/AciCloud.java Outdated Show resolved Hide resolved

timja reviewed Feb 14, 2022

View reviewed changes

fix typo

2f9ac3c

Co-authored-by: Tim Jacomb <[email protected]>

timja added the bug label Feb 14, 2022

timja merged commit f18b5bf into jenkinsci:master Feb 14, 2022

sparsick deleted the 101-tooManyFailedDeploy branch February 14, 2022 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many jobs based on the same agent template produce many failed deployment #102

Many jobs based on the same agent template produce many failed deployment #102

sparsick commented Feb 14, 2022

timja Feb 14, 2022

sparsick Feb 14, 2022

sparsick Feb 14, 2022

timja Feb 14, 2022

sparsick Feb 14, 2022

timja commented Feb 14, 2022

sparsick commented Feb 14, 2022

sparsick commented Feb 14, 2022

Many jobs based on the same agent template produce many failed deployment #102

Many jobs based on the same agent template produce many failed deployment #102

Conversation

sparsick commented Feb 14, 2022

timja Feb 14, 2022

Choose a reason for hiding this comment

sparsick Feb 14, 2022

Choose a reason for hiding this comment

sparsick Feb 14, 2022

Choose a reason for hiding this comment

timja Feb 14, 2022

Choose a reason for hiding this comment

sparsick Feb 14, 2022

Choose a reason for hiding this comment

timja commented Feb 14, 2022

sparsick commented Feb 14, 2022

sparsick commented Feb 14, 2022