Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate the instance when 404 occured. #489

Merged
merged 8 commits into from
Dec 12, 2024

Conversation

Artmorse
Copy link
Contributor

@Artmorse Artmorse commented Nov 28, 2024

This PR is based on #470.

This wants to fix the #469 issue.

Manual testing

I've configured a GCE cloud with the specific label gcp-ce.
I've created a dummy hello-world pipeline using GCE nodes (using the label gcp-ce).
I've provisionned a node from the cloud.
Once the compute engine up and running, I ran the hello-world pipeline and it working as expected.
Then I've deleted it using the GCP UI the node is well terminated and removed from Jenkins.
I re-ran the pipeline and a new node has been provisionned.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

… Else jenkins nodes cannot be deleted if there is no corresponding instance in the cloud.
… of zombie offline nodes in jenkins that failed to start in the cloud.
// return immediately, hoping for the best.
cloud.getClient().terminateInstanceAsync(cloud.getProjectId(), zone, name);
if (instanceExistsInCloud) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Is it just to avoid an unnecessary error message in the logs in case the instance does not exist in the cloud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is based on another branch (see #470) but I don't think this condition is useful. Theoretically we could even verify the instance exists in the cloud but when running the terminateInstanceAsync the instance could have been removed. I will remove that part.

Please @Theoderich if we're missing cases, please tell us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done into f6fc88c

Copy link
Contributor

@Theoderich Theoderich Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the fix for #471 and should not be removed.
The Problem:
When Jenkins tries to start a node and the quotas in the target cloud do not allow it, the instance is not started in the cloud but Jenkins thinks that a node was started. Now there is a Zombie node in Jenkins.

When trying to delete this Zombie node, Jenkins calls cloud.getClient().terminateInstanceAsync which throws an Exception since the Instance does not exist in the cloud. The Delete is aborted and the zombie node stays, forever undeletable.

The Solution:
I added a check if the node actually exists in the cloud before calling cloud.getClient().terminateInstanceAsync. If the node does not exist as an instance in the cloud, there is no need to terminate it and the deletion can proceed.

Why not just catch the exception thrown by cloud.getClient().terminateInstanceAsync?
It is not clear from the exception why the terminate operation failed. If the instance exists in the cloud and the termination failed for some other reason, we do not want to delete the node in Jenkins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the detailed explanation.
I've reverted the changes.

Copy link

@jgreffe jgreffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some effective unit tests for this?

@Artmorse
Copy link
Contributor Author

Are there some effective unit tests for this?

I don't think we can write any unit test for this (apart from maybe using Wiremock but this seems huge...).
This repository has some integration tests but I failed to run them.

@batmat batmat added the bug Something isn't working label Dec 12, 2024
@batmat
Copy link
Member

batmat commented Dec 12, 2024

@Artmorse can you please fix the build failure? Seems like SpotBugs issues?

I just labeled the issue as bug. As one of the maintainers, I will merge this once the build goes green.

@batmat batmat merged commit 8aeebbd into jenkinsci:develop Dec 12, 2024
17 checks passed
@batmat
Copy link
Member

batmat commented Dec 12, 2024

@Artmorse merged. CD is enabled, so a release should be cut automatically within the next minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants