Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many jobs based on the same agent template produce many failed deployment #102

Merged
merged 2 commits into from
Feb 14, 2022

Conversation

sparsick
Copy link
Contributor

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

boolean nullIsThrown;
do {
try {
ip = azureResourceManager.containerGroups()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we check the ip address for null instead and then retry rather than handling a null pointer?
is it possible the IP hasn't been allocated yet? seems quite weird though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the NullPointerException is coming from Resource Manager internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.lang.NullPointerException
        at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupImpl.initializeChildrenFromInner(ContainerGroupImpl.java:217)

azureResourceManager.containerGroups().getByResourceGroup(resourceGroup, agent.getNodeName());

if (containerGroup.containers().containsKey(agent.getNodeName())
&& containerGroup.containers().get(agent.getNodeName()).instanceView().currentState().state()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we check the container for null instead and then retry rather than handling a null pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the NullPointerException is coming from Resource Manager internally

Co-authored-by: Tim Jacomb <[email protected]>
@timja timja added the bug label Feb 14, 2022
@timja
Copy link
Member

timja commented Feb 14, 2022

Have you tested this and it solves the issue?

@sparsick
Copy link
Contributor Author

Yes, I tested it with both kind of container instances (private and public ip addresses) and it works like I expected.

Logoutput for private IP Container usage:

2022-02-14 16:00:20.352+0000 [id=143]   INFO    c.m.j.c.aci.AciCloud#waitToOnline: Waiting agent test-private-qqjz2 to online
2022-02-14 16:00:20.498+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#waitToOnline: Waiting for Agent test-private-qqjz2 produces a NullPointerException, but it is ignored.
2022-02-14 16:00:20.631+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#addIpEnv: During asking for IP address of Agent test-private-qqjz2 NullPointerException is thrown,but it is ignored.
2022-02-14 16:00:25.714+0000 [id=143]   WARNING c.m.j.c.aci.AciCloud#addIpEnv: During asking for IP address of Agent test-private-qqjz2 NullPointerException is thrown,but it is ignored.
2022-02-14 16:00:28.989+0000 [id=35]    INFO    c.m.j.c.s.ContainerOnceRetentionStrategy#done: terminating test-private-lmdhk since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=test-private#69,label=test-private-lmdhk,context=CpsStepContext[3:node]:Owner[test-private/69:test-private #69],cookie=7268d723-939c-4be3-bd57-d38fd8d6e8c2,auth=null} seems to be finished
2022-02-14 16:00:28.993+0000 [id=246]   INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#94] for test-private-lmdhk terminated: java.nio.channels.ClosedChannelException
2022-02-14 16:00:29.095+0000 [id=40]    INFO    c.m.j.c.s.ContainerOnceRetentionStrategy#done: terminating test-private-qqjz2 since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=test-private#71,label=test-private-qqjz2,context=CpsStepContext[3:node]:Owner[test-private/71:test-private #71],cookie=8da4c64f-6c3a-4f1b-a6f9-d5e49fc0cd1c,auth=null} seems to be finished
2022-02-14 16:00:29.099+0000 [id=301]   INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-14 16:00:29.099+0000 [id=301]   INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-14 16:00:29.105+0000 [id=246]   INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#94] for test-private-qqjz2 terminated: java.nio.channels.ClosedChannelException
2022-02-14 16:00:29.255+0000 [id=305]   INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-14 16:00:29.255+0000 [id=305]   INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-14 16:00:29.690+0000 [id=249]   INFO    c.m.j.c.aci.AciService#deleteAciContainerGroup: Delete ACI Container Group: test-private-qqjz2 successfully

Job output:

Started by user admin
[Pipeline] Start of Pipeline
[Pipeline] node
Still waiting to schedule task
‘Jenkins’ doesn’t have label ‘test-private’
Running on test-private-qqjz2 in /home/jenkins/workspace/test-private
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Hello)
[Pipeline] echo
Hello World
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

@sparsick
Copy link
Contributor Author

@timja I will open a new issue to remove this workaround when issue Azure/azure-sdk-for-java#27083 is fixed.

@timja timja merged commit f18b5bf into jenkinsci:master Feb 14, 2022
@sparsick sparsick deleted the 101-tooManyFailedDeploy branch February 14, 2022 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many jobs based on the same agent template produce many failed deployment
2 participants