Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Fleet plugin terminated aws on demand instance when there is a job running on the node #363

Closed
qibint opened this issue Apr 14, 2023 · 10 comments · Fixed by #376
Closed
Assignees
Labels

Comments

@qibint
Copy link

qibint commented Apr 14, 2023

Issue Details

Describe the bug
We see this issue for months when there is a job running on the ec2 on-demand instance, the EC2 fleet plugin just calls the termination for the node and the instance went away for this ASG scaling in.

Here is the job log during the interruption:

04:58:58  dpkg-deb: building package 'xx' in 'x/packages/xx.deb'.
04:58:58  dpkg-symlink: xx_999.HEAD.999-ubuntu.999.deb
06:19:13  Cannot contact i-0d1b60f1c25b3b385: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba57c9e:i-0d1b60f1c25b3b385": Remote call on i-0d1b60f1c25b3b385 failed. The channel is closing down or has closed down
06:19:13  Could not connect to i-0d1b60f1c25b3b385 to send interrupt signal to process
[Pipeline] }
[Pipeline] echo
06:19:13  [ERROR]: Build interruption error: EC2 instance for node FleetCloud i-0d1b60f1c25b3b385 was terminated

From aws cloud trail, we can see the termination was initiated from ec2 fleet plugin:

"eventTime": "2023-04-14T13:19:13Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "TerminateInstances",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "xx.xx.84.40",
    "userAgent": "ec2-fleet-plugin, aws-sdk-java/1.12.287 Linux/5.4.0-1097-aws OpenJDK_64-Bit_Server_VM/11.0.18+10-post-Ubuntu-0ubuntu118.04.1 java/11.0.18 groovy/2.4.21 vendor/Ubuntu cfg/retry-mode/legacy",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-0d1b60f1c25b3b385"
                }
            ]
        }
    },

I created a EC2RetentionStrategy logger recorder for it and it shows:

Checking if node 'i-0d1b60f1c25b3b385' is idle 
Apr 14, 2023 6:18:45 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-08339345e99f0f462' is idle 
Apr 14, 2023 6:18:45 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong
Instance: gpu-minions i-08339345e99f0f462 Age: 64202972 Max Age:300000
Apr 14, 2023 6:18:45 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-0d949b615615e23e0' is idle 
Apr 14, 2023 6:19:02 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy start
Connecting to instance: FleetCloud_depr i-0132fa0351ff2f8a2 Builds left: 1 
Apr 14, 2023 6:19:10 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy postJobAction
Calling scheduleToTerminate for node i-0d1b60f1c25b3b385 due to maxTotalUses (-1)
Apr 14, 2023 6:19:13 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy
Checking if node 'i-0132fa0351ff2f8a2' is idle 

Environment Details

Plugin Version?
<2.5.2>

Jenkins Version?
<2.346.3>

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?

EC2Fleet Configuration as Code
Max Idle Minutes Before Scaledown 5 Minimum Cluster Size 0 Maximum Cluster Size 45 Minimum Spare Size 0 Maximum Total Uses 1 - It's weird that in the above log showing i-0d1b60f1c25b3b385 due to maxTotalUses (-1) Maximum Init Connection Timeout in sec 180 Cloud Status Interval in sec 10 No Delay Provision Strategy - Checked

Anything else unique about your setup?

@qibint qibint added the bug label Apr 14, 2023
@qibint
Copy link
Author

qibint commented May 16, 2023

Anyone has ideas about why it shows maxTotalUses (-1) for i-0d1b60f1c25b3b385 while there is a Jenkins job running on that node?

@qibint
Copy link
Author

qibint commented Jun 2, 2023

Noted that the plugin team has a shortage of maintainers. However, we're experiencing more node removal issues during a job run. The following is another example of when our job was running and the ec2 node just got terminated by the plugin.

Jenkins console output

16:24:28 Loading: 0 packages loaded
16:25:46 FATAL: command execution failed
16:25:46 Command Close created at
16:25:46 	at hudson.remoting.Command.<init>(Command.java:70)
..
16:25:46 Caused: java.io.IOException: Backing channel 'i-099f769bdd8ac3eb4' is disconnected.

When I looked at the jenkins log, it shows:

2023-06-01 23:13:31.086+0000 [id=58]    INFO    c.a.j.e.EC2RetentionStrategy#start: Connecting to instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1 
2023-06-01 23:13:31.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:13:46.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:13:55.857+0000 [id=117]   INFO    c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1  Age: 9769 Max Age:60000
2023-06-01 23:14:01.088+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
2023-06-01 23:14:16.089+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'i-099f769bdd8ac3eb4'. Attempting to connect and waiting before retry
[06/01/23 16:14:21] SSH Launch of i-099f769bdd8ac3eb4 on 172.30.197.234 completed in 50,058 ms
2023-06-01 23:14:21.194+0000 [id=2343534]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:14:31.089+0000 [id=308]   INFO    c.a.j.e.EC2FleetOnlineChecker#run: Node 'i-099f769bdd8ac3eb4' connected. Resolving planned node
2023-06-01 23:15:23.780+0000 [id=2343826]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)
2023-06-01 23:25:43.868+0000 [id=2344254]       INFO    c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud_packer [ec2-fleet-packer] Scheduling instance 'i-099f769bdd8ac3eb4' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@4edfa205 with force: true
2023-06-01 23:25:46.009+0000 [id=2344566]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: FleetCloud_packer i-099f769bdd8ac3eb4
2023-06-01 23:25:46.009+0000 [id=2344566]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Skipping executable resubmission for FleetCloud_packer i-099f769bdd8ac3eb4 - disableTaskResubmit: true - offline: true
2023-06-01 23:25:46.014+0000 [id=2343785]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: FleetCloud_packer i-099f769bdd8ac3eb4
2023-06-01 23:25:46.014+0000 [id=117]   INFO    c.a.j.ec2fleet.EC2FleetCloud#info: FleetCloud_packer [ec2-fleet-packer] Terminating nodes: [i-099f769bdd8ac3eb4, i-0d651f0f92bf1dfe8]
2023-06-01 23:25:46.014+0000 [id=2343785]       INFO    c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Skipping executable resubmission for FleetCloud_packer i-099f769bdd8ac3eb4 - disableTaskResubmit: true - offline: true

The log looks good until this line:

2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)

As I mentioned in the above ticket, it marked the maxTotalUses (-1) while the job was still actively running.

Thank you for looking into this,
Best

@qibint
Copy link
Author

qibint commented Jun 6, 2023

Had several cases like the above again.
I think there is something wrong with the condition in this line - https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java#L345
How do we define whether the agent is in suspended state or not?

For example, the agent got removed while the jenkins console output:

08:44:18 ....
08:44:18 Fetching https://github.com/gbxl/flipflop.git
08:44:20 Fetching https://github.com/Shopify/oktakit.git
08:44:21 FATAL: command execution failed
08:44:21 Command Close created at
08:44:21 	at hudson.remoting.Command.<init>(Command.java:70)
...
08:44:21 Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@37a393fa:i-0218c8e66ccdba530": Remote call on i-0218c8e66ccdba530 failed. The channel is closing down or has closed down

@pdk27 pdk27 self-assigned this Jun 21, 2023
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 21, 2023
…tion

-rename force to overrideOtherSettings in scheduleToTerminate for clarity

jenkinsci#363
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 21, 2023
…l meaning)

[fix] remove misleading log and redundant set

jenkinsci#363
@pdk27
Copy link
Collaborator

pdk27 commented Jun 21, 2023

@qibint Thanks for the details.
I wasn't able to reproduce this on my end. However, from logs it looks like you might have hit a race condition where an extra task might be accepted by FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1 when it shouldn't have?
You could configure your Jenkins logger with logs from hudson.model.Executor to dig into what task was accepted and when. hudson.model.Executor interrupt also shows a nice stacktrace that might help understand the scenario better.

It probably makes sense to wait to terminate an instance if there are busy executors doing work. We are looking into this option. Will be great if you can share more logs from Executor and EC2FleetCloud update cycle which looks like start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@xxxxxx in logs.

In the mean time, we have some extra logging and fixes to add clarity.

@qibint
Copy link
Author

qibint commented Jun 22, 2023

Hi @pdk27. Thank you for the reply.
The executor we set is 1 per instance so having a race condition in this case is weird. But it's a good idea to have a logger to look into more.
Even with multiple executors and race conditions in this case, shouldn't we keep the instance running instead of terminating it since there are running tasks actively?

@pdk27
Copy link
Collaborator

pdk27 commented Jun 22, 2023

Yes, @qibint. That makes sense. We are working on that change :)

@pdk27
Copy link
Collaborator

pdk27 commented Jun 22, 2023

@qibint As per your logs, the last print shows maxTotalUses (-1)
, which means maxTotalUses of 1 was decremented twice for the agent i.e. 2 tasks get accepted but your numExecutors is set to 1, is that correct?

The plugin schedules the instance for termination if the current executor is the only active executor on the agent (using computer.countBusy()) and since this check happens in postJobAction of the last active executor i.e. after the actual task is done, the assumption that it can be terminated seems reasonable.
The only thing I can think about is that the postJobAction is run whether the task is completed successfully or not. I will look into adding clarity around this in the logs.
If you can share more information about what was running /interrupted in your scenario, it will be helpful.

Pls ignore the misleading message in the log that shows Calling scheduleToTerminate for node i-0d1b60f1c25b3b385 due to maxTotalUses (-1). The -1 here is a result of an unnecessary decrement that happens when maxTotalUses is 1.


2023-06-01 23:13:55.857+0000 [id=117]   INFO    c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance: FleetCloud_packer i-099f769bdd8ac3eb4 Builds left: 1  Age: 9769 Max Age:60000
2023-06-01 23:14:21.194+0000 [id=2343534]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:15:23.780+0000 [id=2343826]       INFO    c.a.j.e.EC2RetentionStrategy#taskAccepted: maxTotalUses drained - suspending agent after current build i-099f769bdd8ac3eb4
2023-06-01 23:25:43.867+0000 [id=2344254]       INFO    c.a.j.e.EC2RetentionStrategy#postJobAction: Calling scheduleToTerminate for node i-099f769bdd8ac3eb4 due to maxTotalUses (-1)

@qibint
Copy link
Author

qibint commented Jun 23, 2023

Okay, I see. I thought -1 means unlimited use and got confused.
Let me keep watching and share the executor log once it happens.

Thanks!

pdk27 added a commit that referenced this issue Jun 23, 2023
…und terminations triggered by plugin (#375)

* rename EC2TerminationCause to EC2ExecutorInterruptionCause for clarity

* -add and track EC2AgentTerminationReason when plugin triggers termination
-rename force to overrideOtherSettings in scheduleToTerminate for clarity

#363

* [fix] remove unnecessary decrement which could lead to -1 (has special meaning)

[fix] remove misleading log and redundant set

#363

* update and add tests

* rename overrideOtherSettings to ignoreMinConstraints
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 28, 2023
@pdk27 pdk27 mentioned this issue Jun 28, 2023
pdk27 added a commit that referenced this issue Jun 28, 2023
* [fix] Terminate scheduled instances ONLY IF idle

#363

* [fix] leave maxTotalUses alone and track remainingUses correctly

add a flag to track termination of agents by plugin

* [fix] Fix lost state (instanceIdsToTerminate) on configuration change

[fix] Fix maxtotaluses decrement logic

add logs in post job action to expose tasks terminated with problems

#322

add and fix tests

* add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination
@pdk27
Copy link
Collaborator

pdk27 commented Jun 28, 2023

Sample logs from fix:

EC2FleetCloud info INFO: FleetCloud [label] Set target capacity to '2'
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-1' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: Agent idle for too long
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-2' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: MaxTotalUses exhausted for agent
EC2FleetCloud info INFO: FleetCloud [label] Scheduling instance 'i-3' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@22f34197 because of reason: Agent deleted
EC2FleetCloud info INFO: FleetCloud [label] Skipping termination of the following instances until the next update cycle, as they are still busy doing some work: [i-2].
EC2FleetCloud info INFO: FleetCloud [label] Set target capacity to '0'
EC2FleetCloud info INFO: FleetCloud [label] Removing Jenkins nodes before terminating corresponding EC2 instances
EC2FleetCloud info INFO: FleetCloud [label] Terminated instances: {i-1=Agent idle for too long, i-3=Agent deleted}

@pdk27
Copy link
Collaborator

pdk27 commented Jun 29, 2023

Fixed in release ec2-fleet-2.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants