EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

qibint · 2023-04-14T13:47:20Z

Issue Details

Describe the bug
We see this issue for months when there is a job running on the ec2 on-demand instance, the EC2 fleet plugin just calls the termination for the node and the instance went away for this ASG scaling in.

Here is the job log during the interruption:

04:58:58  dpkg-deb: building package 'xx' in 'x/packages/xx.deb'.
04:58:58  dpkg-symlink: xx_999.HEAD.999-ubuntu.999.deb
06:19:13  Cannot contact i-0d1b60f1c25b3b385: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba57c9e:i-0d1b60f1c25b3b385": Remote call on i-0d1b60f1c25b3b385 failed. The channel is closing down or has closed down
06:19:13  Could not connect to i-0d1b60f1c25b3b385 to send interrupt signal to process
[Pipeline] }
[Pipeline] echo
06:19:13  [ERROR]: Build interruption error: EC2 instance for node FleetCloud i-0d1b60f1c25b3b385 was terminated

From aws cloud trail, we can see the termination was initiated from ec2 fleet plugin:

"eventTime": "2023-04-14T13:19:13Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "TerminateInstances",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "xx.xx.84.40",
    "userAgent": "ec2-fleet-plugin, aws-sdk-java/1.12.287 Linux/5.4.0-1097-aws OpenJDK_64-Bit_Server_VM/11.0.18+10-post-Ubuntu-0ubuntu118.04.1 java/11.0.18 groovy/2.4.21 vendor/Ubuntu cfg/retry-mode/legacy",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-0d1b60f1c25b3b385"
                }
            ]
        }
    },

I created a EC2RetentionStrategy logger recorder for it and it shows:

Checking if node 'i-0d1b60f1c25b3b385' is idle 
Apr 14, 2023 6:18:45 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-08339345e99f0f462' is idle 
Apr 14, 2023 6:18:45 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong
Instance: gpu-minions i-08339345e99f0f462 Age: 64202972 Max Age:300000
Apr 14, 2023 6:18:45 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-0d949b615615e23e0' is idle 
Apr 14, 2023 6:19:02 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy start
Connecting to instance: FleetCloud_depr i-0132fa0351ff2f8a2 Builds left: 1 
Apr 14, 2023 6:19:10 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy postJobAction
Calling scheduleToTerminate for node i-0d1b60f1c25b3b385 due to maxTotalUses (-1)
Apr 14, 2023 6:19:13 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-0132fa0351ff2f8a2' is idle

Environment Details

Plugin Version?
<2.5.2>

Jenkins Version?
<2.346.3>

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?

EC2Fleet Configuration as Code
Max Idle Minutes Before Scaledown 5 Minimum Cluster Size 0 Maximum Cluster Size 45 Minimum Spare Size 0 Maximum Total Uses 1 - It's weird that in the above log showing i-0d1b60f1c25b3b385 due to maxTotalUses (-1) Maximum Init Connection Timeout in sec 180 Cloud Status Interval in sec 10 No Delay Provision Strategy - Checked

Anything else unique about your setup?

The text was updated successfully, but these errors were encountered:

qibint · 2023-05-16T21:05:38Z

Anyone has ideas about why it shows maxTotalUses (-1) for i-0d1b60f1c25b3b385 while there is a Jenkins job running on that node?

qibint · 2023-06-02T21:17:17Z

Noted that the plugin team has a shortage of maintainers. However, we're experiencing more node removal issues during a job run. The following is another example of when our job was running and the ec2 node just got terminated by the plugin.

Jenkins console output

16:24:28 Loading: 0 packages loaded
16:25:46 FATAL: command execution failed
16:25:46 Command Close created at
16:25:46 	at hudson.remoting.Command.<init>(Command.java:70)
..
16:25:46 Caused: java.io.IOException: Backing channel 'i-099f769bdd8ac3eb4' is disconnected.

When I looked at the jenkins log, it shows:

2023-06-01 23:13:31.086+0000 [id=58]    INFO    c.a.j.e.EC2RetentionStrategy#start: Connecting to instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1 
2023-06-01 23:13:31.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:13:46.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:13:55.857+0000 [id=117]   INFO    c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1  Age: 9769 Max Age:60000
2023-06-01 23:14:01.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:14:16.089+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
[06/01/23 16:14:21] SSH Launch of i-099f769bdd8ac3eb4 on 172.30.197.234 completed in 50,058 ms
2023-06-01 23:14:21.194+0000 [id=2343534]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:14:31.089+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: Node 'i-099f769bdd8ac3eb4' connected. Resolving planned node
2023-06-01 23:15:23.780+0000 [id=2343826]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)
2023-06-01 23:25:43.868+0000 [id=2344254]       INFO    c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud_packer [ec2-fleet-packer] Scheduling instance 'i-099f769bdd8ac3eb4' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@4edfa205 with force: true
2023-06-01 23:25:46.009+0000 [id=2344566]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: FleetCloud_packer i-099f769bdd8ac3eb4
2023-06-01 23:25:46.009+0000 [id=2344566]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Skipping executable resubmission for FleetCloud_packer i-099f769bdd8ac3eb4 - disableTaskResubmit: true - offline: true
2023-06-01 23:25:46.014+0000 [id=2343785]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: FleetCloud_packer i-099f769bdd8ac3eb4
2023-06-01 23:25:46.014+0000 [id=117]   INFO    c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud_packer [ec2-fleet-packer] Terminating nodes: [i-099f769bdd8ac3eb4, i-0d651f0f92bf1dfe8]
2023-06-01 23:25:46.014+0000 [id=2343785]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Skipping executable resubmission for FleetCloud_packer i-099f769bdd8ac3eb4 - disableTaskResubmit: true - offline: true

The log looks good until this line:

2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)

As I mentioned in the above ticket, it marked the maxTotalUses (-1) while the job was still actively running.

Thank you for looking into this,
Best

qibint · 2023-06-06T20:57:38Z

Had several cases like the above again.
I think there is something wrong with the condition in this line - https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java#L345
How do we define whether the agent is in suspended state or not?

For example, the agent got removed while the jenkins console output:

08:44:18 ....
08:44:18 Fetching https://github.com/gbxl/flipflop.git
08:44:20 Fetching https://github.com/Shopify/oktakit.git
08:44:21 FATAL: command execution failed
08:44:21 Command Close created at
08:44:21 	at hudson.remoting.Command.<init>(Command.java:70)
...
08:44:21 Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@37a393fa:i-0218c8e66ccdba530": Remote call on i-0218c8e66ccdba530 failed. The channel is closing down or has closed down

…tion -rename force to overrideOtherSettings in scheduleToTerminate for clarity jenkinsci#363

…l meaning) [fix] remove misleading log and redundant set jenkinsci#363

pdk27 · 2023-06-21T22:25:57Z

@qibint Thanks for the details.
I wasn't able to reproduce this on my end. However, from logs it looks like you might have hit a race condition where an extra task might be accepted by FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1 when it shouldn't have?
You could configure your Jenkins logger with logs from hudson.model.Executor to dig into what task was accepted and when. hudson.model.Executor interrupt also shows a nice stacktrace that might help understand the scenario better.

It probably makes sense to wait to terminate an instance if there are busy executors doing work. We are looking into this option. Will be great if you can share more logs from Executor and EC2FleetCloud update cycle which looks like start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@xxxxxx in logs.

In the mean time, we have some extra logging and fixes to add clarity.

qibint · 2023-06-22T15:46:03Z

Hi @pdk27. Thank you for the reply.
The executor we set is 1 per instance so having a race condition in this case is weird. But it's a good idea to have a logger to look into more.
Even with multiple executors and race conditions in this case, shouldn't we keep the instance running instead of terminating it since there are running tasks actively?

pdk27 · 2023-06-22T17:47:52Z

Yes, @qibint. That makes sense. We are working on that change :)

pdk27 · 2023-06-22T18:55:03Z

@qibint As per your logs, the last print shows maxTotalUses (-1)
, which means maxTotalUses of 1 was decremented twice for the agent i.e. 2 tasks get accepted but your numExecutors is set to 1, is that correct?

The plugin schedules the instance for termination if the current executor is the only active executor on the agent (using computer.countBusy()) and since this check happens in postJobAction of the last active executor i.e. after the actual task is done, the assumption that it can be terminated seems reasonable.
The only thing I can think about is that the postJobAction is run whether the task is completed successfully or not. I will look into adding clarity around this in the logs.
If you can share more information about what was running /interrupted in your scenario, it will be helpful.

Pls ignore the misleading message in the log that shows Calling scheduleToTerminate for node i-0d1b60f1c25b3b385 due to maxTotalUses (-1). The -1 here is a result of an unnecessary decrement that happens when maxTotalUses is 1.


2023-06-01 23:13:55.857+0000 [id=117]   INFO    c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1  Age: 9769 Max Age:60000
2023-06-01 23:14:21.194+0000 [id=2343534]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:15:23.780+0000 [id=2343826]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)

qibint · 2023-06-23T14:49:20Z

Okay, I see. I thought -1 means unlimited use and got confused.
Let me keep watching and share the executor log once it happens.

Thanks!

…und terminations triggered by plugin (#375) * rename EC2TerminationCause to EC2ExecutorInterruptionCause for clarity * -add and track EC2AgentTerminationReason when plugin triggers termination -rename force to overrideOtherSettings in scheduleToTerminate for clarity #363 * [fix] remove unnecessary decrement which could lead to -1 (has special meaning) [fix] remove misleading log and redundant set #363 * update and add tests * rename overrideOtherSettings to ignoreMinConstraints

jenkinsci#363

* [fix] Terminate scheduled instances ONLY IF idle #363 * [fix] leave maxTotalUses alone and track remainingUses correctly add a flag to track termination of agents by plugin * [fix] Fix lost state (instanceIdsToTerminate) on configuration change [fix] Fix maxtotaluses decrement logic add logs in post job action to expose tasks terminated with problems #322 add and fix tests * add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination

pdk27 · 2023-06-28T18:43:28Z

Sample logs from fix:

EC2FleetCloud info INFO: FleetCloud [label] Set target capacity to '2'
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-1' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: Agent idle for too long
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-2' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: MaxTotalUses exhausted for agent
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-3' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: Agent deleted
EC2FleetCloud info INFO: FleetCloud [label] Skipping termination of the following instances until the next update cycle, as they are still busy doing some work: [i-2].
EC2FleetCloud info INFO: FleetCloud [label] Set target capacity to '0'
EC2FleetCloud info INFO: FleetCloud [label] Removing Jenkins nodes before terminating corresponding EC2 instances
EC2FleetCloud info INFO: FleetCloud [label] Terminated instances: {i-1=Agent idle for too long, i-3=Agent deleted}

pdk27 · 2023-06-29T00:44:23Z

Fixed in release ec2-fleet-2.7.0

qibint added the bug label Apr 14, 2023

eva-mueller-coremedia mentioned this issue May 18, 2023

Is this plugin still actively maintained? #366

Closed

pdk27 self-assigned this Jun 21, 2023

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 21, 2023

-add and track EC2AgentTerminationReason when plugin triggers termina…

237d494

…tion -rename force to overrideOtherSettings in scheduleToTerminate for clarity jenkinsci#363

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 21, 2023

[fix] remove unnecessary decrement which could lead to -1 (has specia…

aaa6172

…l meaning) [fix] remove misleading log and redundant set jenkinsci#363

pdk27 mentioned this issue Jun 21, 2023

Fix unnecessary decrement of maxTotalUses and add clarity to logs around terminations triggered by plugin #375

Merged

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023

[fix] Terminate scheduled instances ONLY IF idle

e9e9273

jenkinsci#363

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023

[fix] Terminate scheduled instances ONLY IF idle

d8d107b

jenkinsci#363

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023

[fix] Terminate scheduled instances ONLY IF idle

3f0b038

jenkinsci#363

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 28, 2023

[fix] Terminate scheduled instances ONLY IF idle

769e8a3

jenkinsci#363

pdk27 mentioned this issue Jun 28, 2023

Fix 322 and 363 #376

Merged

pdk27 closed this as completed in #376 Jun 28, 2023

pdk27 mentioned this issue Aug 3, 2023

EC2 fleet instance unexpectedly terminated by ec2-fleet-plugin #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

qibint commented Apr 14, 2023

qibint commented May 16, 2023

qibint commented Jun 2, 2023

qibint commented Jun 6, 2023 •

edited

Loading

pdk27 commented Jun 21, 2023

qibint commented Jun 22, 2023

pdk27 commented Jun 22, 2023

pdk27 commented Jun 22, 2023

qibint commented Jun 23, 2023

pdk27 commented Jun 28, 2023

pdk27 commented Jun 29, 2023

EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

Comments

qibint commented Apr 14, 2023

Issue Details

Environment Details

qibint commented May 16, 2023

qibint commented Jun 2, 2023

qibint commented Jun 6, 2023 • edited Loading

pdk27 commented Jun 21, 2023

qibint commented Jun 22, 2023

pdk27 commented Jun 22, 2023

pdk27 commented Jun 22, 2023

qibint commented Jun 23, 2023

pdk27 commented Jun 28, 2023

pdk27 commented Jun 29, 2023

qibint commented Jun 6, 2023 •

edited

Loading