-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363
Comments
Anyone has ideas about why it shows maxTotalUses (-1) for i-0d1b60f1c25b3b385 while there is a Jenkins job running on that node? |
Noted that the plugin team has a shortage of maintainers. However, we're experiencing more node removal issues during a job run. The following is another example of when our job was running and the ec2 node just got terminated by the plugin. Jenkins console output
When I looked at the jenkins log, it shows:
The log looks good until this line:
As I mentioned in the above ticket, it marked the maxTotalUses (-1) while the job was still actively running. Thank you for looking into this, |
Had several cases like the above again. For example, the agent got removed while the jenkins console output:
|
…tion -rename force to overrideOtherSettings in scheduleToTerminate for clarity jenkinsci#363
…l meaning) [fix] remove misleading log and redundant set jenkinsci#363
@qibint Thanks for the details. It probably makes sense to wait to terminate an instance if there are busy executors doing work. We are looking into this option. Will be great if you can share more logs from In the mean time, we have some extra logging and fixes to add clarity. |
Hi @pdk27. Thank you for the reply. |
Yes, @qibint. That makes sense. We are working on that change :) |
@qibint As per your logs, the last print shows maxTotalUses (-1) The plugin schedules the instance for termination if the current executor is the only active executor on the agent (using Pls ignore the misleading message in the log that shows
|
Okay, I see. I thought -1 means unlimited use and got confused. Thanks! |
…und terminations triggered by plugin (#375) * rename EC2TerminationCause to EC2ExecutorInterruptionCause for clarity * -add and track EC2AgentTerminationReason when plugin triggers termination -rename force to overrideOtherSettings in scheduleToTerminate for clarity #363 * [fix] remove unnecessary decrement which could lead to -1 (has special meaning) [fix] remove misleading log and redundant set #363 * update and add tests * rename overrideOtherSettings to ignoreMinConstraints
* [fix] Terminate scheduled instances ONLY IF idle #363 * [fix] leave maxTotalUses alone and track remainingUses correctly add a flag to track termination of agents by plugin * [fix] Fix lost state (instanceIdsToTerminate) on configuration change [fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems #322 add and fix tests * add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination
Sample logs from fix:
|
Fixed in release ec2-fleet-2.7.0 |
Issue Details
Describe the bug
We see this issue for months when there is a job running on the ec2 on-demand instance, the EC2 fleet plugin just calls the termination for the node and the instance went away for this ASG scaling in.
Here is the job log during the interruption:
From aws cloud trail, we can see the termination was initiated from ec2 fleet plugin:
I created a EC2RetentionStrategy logger recorder for it and it shows:
Environment Details
Plugin Version?
<2.5.2>
Jenkins Version?
<2.346.3>
Spot Fleet or ASG?
ASG
Label based fleet?
No
Linux or Windows?
EC2Fleet Configuration as Code
Max Idle Minutes Before Scaledown 5 Minimum Cluster Size 0 Maximum Cluster Size 45 Minimum Spare Size 0 Maximum Total Uses 1 - It's weird that in the above log showing i-0d1b60f1c25b3b385 due to maxTotalUses (-1) Maximum Init Connection Timeout in sec 180 Cloud Status Interval in sec 10 No Delay Provision Strategy - Checked
Anything else unique about your setup?
The text was updated successfully, but these errors were encountered: