Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintenance window's "NextExecutionTime" is updated as soon as execution begins, causing instances to be shut down at the next scheduler interval #101

Closed
georgematthew opened this issue Jun 26, 2019 · 11 comments

Comments

@georgematthew
Copy link

After working around #99 and #100, I am still unable to use the SSM maintenance window functionality. I've attempted to outline the behavior I am seeing below. Please let me know if I can elaborate on anything.

  1. The instances that are configured with a schedule that references a maintenance window are started at least 10 minutes before the maintenance window based on the schedule/period created from the maintenance window's NextExecutionTime. The running period is 2 hours in duration, as expected. This matches the maintenance window duration.

  2. The SSM maintenance window tasks begin. By this time the instances are running and recognized by SSM. I am executing Run Command tasks to run the AWS-UpdateSSMAgent and AWS-RunPatchBaseline documents.

  3. At the next scheduler interval (10 minutes later, for example), the instances are stopped because the scheduler has created a new schedule/period based on the maintenance window's updated NextExecutionTime. It appears that the previously created period/schedule is overwritten and the scheduler believes that the desired state is "stopped". In my case, the NextExecutionTime is one week in the future, as the maintenance window is scheduled once per week. This causes the pending Run Command tasks to fail and tasks that have yet to start to report NoInstancesInTag.

The expected behavior is that the scheduler would keep the instances running for the duration of the maintenance window.

Is this a bug in the scheduler's maintenance window functionality or am I failing to understand something about how this solution is intended to be used?

@georgebearden
Copy link
Member

Hi George - Again, sorry for the delay on this :) Let me get a test environment set up so I can run through this scenario specifically, and then update this issue with findings.

@georgematthew
Copy link
Author

Hi George, have you had a chance to test this?

@tapughose
Copy link

@georgematthew , I tried to reproduce the issue. I was only able to reproduce if I had error in setting up ssm_maintenance_window and use_maintenance_window. In order to make scheduler to honor maintenance window we need to set ssm_maintenance_window to maintenance_window name. In addition, it is also required to set use_maintenance_window to true.

Here is a working example of a schedule Item from my test table in DynamoDb:

{
  "name": {
    "S": "test-schedule"
  },
  "periods": {
    "SS": [
      "test-period"
    ]
  },
  "ssm_maintenance_window": {
    "S": "test-ssm-mw"
  },
  "timezone": {
    "S": "UTC"
  },
  "type": {
    "S": "schedule"
  },
  "use_maintenance_window": {
    "BOOL": true
  }
}

The test-period that used in periods looks like as follows:

{
  "begintime": {
    "S": "6:00"
  },
  "endtime": {
    "S": "17:00"
  },
  "name": {
    "S": "test-period"
  },
  "type": {
    "S": "period"
  },
  "weekdays": {
    "SS": [
      "mon-sun"
    ]
  }
}

The instance scheduler first tests if the instance has a maintenance window (here, test-ssm-mw) in which it must be running. If not then the scheduler checks condition for period (here, test-period).

I was wondering if you can confirm that you have used both ssm_maintenance_window and use_maintenance_window properties as outlined.

@georgebearden
Copy link
Member

This issue should now be resolved. Please let us know if this is not the case.

@georgematthew
Copy link
Author

@tapughose And the scheduler kept your instance running for the duration of the maintenance window? This issue is still occurring for me in v1.3. I'm seeing the same behavior that I describe above.

Here is my schedule:

{
  "description": "Keep instances off except for the maintenance window.",
  "enforced": true,
  "name": "AlwaysOff",
  "periods": [
    "AlwaysOff"
  ],
  "ssm_maintenance_window": "test-maintenance-window",
  "type": "schedule",
  "use_maintenance_window": true
}

and the period:

{
  "description": "Keep instances off.",
  "endtime": "00:00",
  "name": "AlwaysOff",
  "type": "period"
}

and I've pasted the scheduler logs below, where you can see that the scheduler successfully detects the maintenance window and starts the instance at 18:20. When the scheduler runs again at 18:25, it leaves the instance in the running state. When the scheduler runs at 18:30, which is the start of the 2hr maintenance window, it shuts the instance down while the maintenance window is still InProgress. Based on the logs, the scheduler has created a new running period for the next execution of the maintenance window (tomorrow).

2019-10-16 - 18:20:19.716 - INFO : Handler SchedulerRequestHandler scheduling request for service(s) ec2, account(s) [xxxxxxxxxxx], region(s) us-east-1 at 2019-10-16 18:20:19.716431
2019-10-16 - 18:20:19.934 - INFO : Running EC2 scheduler for account [xxxxxxxxxxx] in region(s) us-east-1
2019-10-16 - 18:20:20.694 - INFO : Fetching ec2 instances for account [xxxxxxxxxxx] in region us-east-1
2019-10-16 - 18:20:21.489 - INFO : Created schedule test-maintenance-window from SSM maintence window, start is 2019-10-16T14:20:00-04:00, end is 2019-10-16T16:30:00-04:00
2019-10-16 - 18:20:21.489 - INFO : SSM maintenance window disabled (mw-[xxxxxxxxxxx]) is disabled
2019-10-16 - 18:20:21.490 - DEBUG : Selected ec2 instance i-[xxxxxxxxxxx] in state (stopped)
2019-10-16 - 18:20:21.490 - INFO : Number of fetched ec2 instances is 1, number of instances in a schedulable state is 1
2019-10-16 - 18:20:21.751 - DEBUG : [ Instance EC2:i-[xxxxxxxxxxx] (Test) ]
2019-10-16 - 18:20:21.751 - DEBUG : Current state is stopped, instance type is t2.micro, schedule is "AlwaysOff"
2019-10-16 - 18:20:21.751 - INFO : Maintenance window "test-maintenance-window" used as running period found for instance i-[xxxxxxxxxxx]
2019-10-16 - 18:20:21.752 - DEBUG : Time used to determine desired for instance is Wed Oct 16 14:20:21 2019
2019-10-16 - 18:20:21.752 - DEBUG : Checking conditions for period "test-maintenance-window-period"
2019-10-16 - 18:20:21.752 - DEBUG : [running] Month "oct" in months (oct)
2019-10-16 - 18:20:21.752 - DEBUG : [running] Day of month 16 in month days (16)
2019-10-16 - 18:20:21.752 - DEBUG : [running] Time 14:20:21 is within 14:20:00-16:30:00, returned state is running
2019-10-16 - 18:20:21.752 - DEBUG : Active period in schedule "test-maintenance-window": "test-maintenance-window-period"
2019-10-16 - 18:20:21.752 - DEBUG : Desired state for instance from schedule "AlwaysOff" is running, last desired state was stopped, actual state is stopped
2019-10-16 - 18:20:21.752 - DEBUG : Using enforcement flag of schedule to set actual state of instance EC2:i-[xxxxxxxxxxx] (Test) from stopped to running
2019-10-16 - 18:20:21.752 - DEBUG : Listing instance EC2:i-[xxxxxxxxxxx] (Test) in region us-east-1 with instance type t2.micro to be started by scheduler
2019-10-16 - 18:20:21.752 - INFO : Starting instances EC2:i-[xxxxxxxxxxx] (Test) in region us-east-1
2019-10-16 - 18:20:22.534 - INFO : Scheduler result {'[xxxxxxxxxxx]': {'started': {'us-east-1': [{'i-[xxxxxxxxxxx]': {'schedule': 'AlwaysOff'}}]}, 'stopped': {}, 'resized': {}}}
2019-10-16 - 18:25:19.630 - INFO : Handler SchedulerRequestHandler scheduling request for service(s) ec2, account(s) [xxxxxxxxxxx], region(s) us-east-1 at 2019-10-16 18:25:19.630564
2019-10-16 - 18:25:19.849 - INFO : Running EC2 scheduler for account [xxxxxxxxxxx] in region(s) us-east-1
2019-10-16 - 18:25:20.488 - INFO : Fetching ec2 instances for account [xxxxxxxxxxx] in region us-east-1
2019-10-16 - 18:25:21.200 - INFO : Created schedule test-maintenance-window from SSM maintence window, start is 2019-10-16T14:20:00-04:00, end is 2019-10-16T16:30:00-04:00
2019-10-16 - 18:25:21.200 - INFO : SSM maintenance window disabled (mw-[xxxxxxxxxxx]) is disabled
2019-10-16 - 18:25:21.202 - DEBUG : Selected ec2 instance i-[xxxxxxxxxxx] in state (running)
2019-10-16 - 18:25:21.202 - INFO : Number of fetched ec2 instances is 1, number of instances in a schedulable state is 1
2019-10-16 - 18:25:21.469 - DEBUG : [ Instance EC2:i-[xxxxxxxxxxx] (Test) ]
2019-10-16 - 18:25:21.469 - DEBUG : Current state is running, instance type is t2.micro, schedule is "AlwaysOff"
2019-10-16 - 18:25:21.469 - INFO : Maintenance window "test-maintenance-window" used as running period found for instance i-[xxxxxxxxxxx]
2019-10-16 - 18:25:21.469 - DEBUG : Time used to determine desired for instance is Wed Oct 16 14:25:21 2019
2019-10-16 - 18:25:21.469 - DEBUG : Checking conditions for period "test-maintenance-window-period"
2019-10-16 - 18:25:21.469 - DEBUG : [running] Month "oct" in months (oct)
2019-10-16 - 18:25:21.469 - DEBUG : [running] Day of month 16 in month days (16)
2019-10-16 - 18:25:21.469 - DEBUG : [running] Time 14:25:21 is within 14:20:00-16:30:00, returned state is running
2019-10-16 - 18:25:21.469 - DEBUG : Active period in schedule "test-maintenance-window": "test-maintenance-window-period"
2019-10-16 - 18:25:21.469 - DEBUG : Desired state for instance from schedule "AlwaysOff" is running, last desired state was running, actual state is running
2019-10-16 - 18:25:21.469 - INFO : Scheduler result {'[xxxxxxxxxxx]': {'started': {}, 'stopped': {}, 'resized': {}}}
2019-10-16 - 18:30:20.476 - INFO : Handler SchedulerRequestHandler scheduling request for service(s) ec2, account(s) [xxxxxxxxxxx], region(s) us-east-1 at 2019-10-16 18:30:20.476061
2019-10-16 - 18:30:20.693 - INFO : Running EC2 scheduler for account [xxxxxxxxxxx] in region(s) us-east-1
2019-10-16 - 18:30:21.234 - INFO : Fetching ec2 instances for account [xxxxxxxxxxx] in region us-east-1
2019-10-16 - 18:30:21.888 - INFO : Created schedule test-maintenance-window from SSM maintence window, start is 2019-10-17T14:20:00-04:00, end is 2019-10-17T16:30:00-04:00
2019-10-16 - 18:30:21.888 - INFO : SSM maintenance window disabled (mw-[xxxxxxxxxxx]) is disabled
2019-10-16 - 18:30:21.894 - DEBUG : Selected ec2 instance i-[xxxxxxxxxxx] in state (running)
2019-10-16 - 18:30:21.894 - INFO : Number of fetched ec2 instances is 1, number of instances in a schedulable state is 1
2019-10-16 - 18:30:22.155 - DEBUG : [ Instance EC2:i-[xxxxxxxxxxx] (Test) ]
2019-10-16 - 18:30:22.155 - DEBUG : Current state is running, instance type is t2.micro, schedule is "AlwaysOff"
2019-10-16 - 18:30:22.155 - INFO : Maintenance window "test-maintenance-window" used as running period found for instance i-[xxxxxxxxxxx]
2019-10-16 - 18:30:22.155 - DEBUG : Time used to determine desired for instance is Wed Oct 16 14:30:22 2019
2019-10-16 - 18:30:22.155 - DEBUG : Checking conditions for period "test-maintenance-window-period"
2019-10-16 - 18:30:22.155 - DEBUG : [running] Month "oct" in months (oct)
2019-10-16 - 18:30:22.155 - DEBUG : [stopped] Day of month 16 not in month days (17)
2019-10-16 - 18:30:22.155 - DEBUG : No running periods at this time found in schedule "test-maintenance-window" for this time, desired state is stopped
2019-10-16 - 18:30:22.155 - DEBUG : Time used to determine desired for instance is Wed Oct 16 14:30:17 2019
2019-10-16 - 18:30:22.155 - DEBUG : Checking conditions for period "AlwaysOff"
2019-10-16 - 18:30:22.155 - DEBUG : [stopped] Time 14:30:17 is after stoptime 00:00:00, returned state is stopped
2019-10-16 - 18:30:22.155 - DEBUG : No running periods at this time found in schedule "AlwaysOff" for this time, desired state is stopped
2019-10-16 - 18:30:22.155 - DEBUG : Desired state for instance from schedule "AlwaysOff" is stopped, last desired state was running, actual state is running
2019-10-16 - 18:30:22.155 - DEBUG : Using enforcement flag of schedule to set actual state of instance EC2:i-[xxxxxxxxxxx] (Test) from running to stopped
2019-10-16 - 18:30:22.155 - DEBUG : Listing instance EC2:i-[xxxxxxxxxxx] (Test) in region us-east-1 to be stopped by scheduler
2019-10-16 - 18:30:22.155 - INFO : Stopping instances EC2:i-[xxxxxxxxxxx] (Test) in region us-east-1
2019-10-16 - 18:30:22.654 - INFO : Scheduler result {'[xxxxxxxxxxx]': {'started': {}, 'stopped': {'us-east-1': [{'i-[xxxxxxxxxxx]': {'schedule': 'AlwaysOff'}}]}, 'resized': {}}}

Is there a check that is done to see if the ssm_maintenance_window that is defined in the schedule is currently running? The scheduler seems to only be taking the NextExecutionTime of the maintenance window and not its current execution.

@tapughose
Copy link

@georgematthew, yes.. the scheduler kept my instance running for the duration of the maintenance window. I will make a schedule and period as of yours and let me see if I can find something.

@georgematthew
Copy link
Author

@tapughose I tried blowing away the scheduler stack and redeploying with the newest version, no luck. I experienced the same behavior that I describe above.

@georgebearden I don't yet consider this issue resolved. The instance scheduler is not keeping instances running for the duration of the maintenance window when configured with the schedule/period I have posted above. I would consider this issue resolved if someone is able to point to an error in the configuration I have posted above or an update is released that resolves the issue with the provided configuration.

Please let me know if I can provide any additional debugging information. Thank you.

@hoppalotta
Copy link

I can confirm I am seeing the exact same behavior as @georgematthew.
Using 1.3 and a schedule that is effectively always off other than the ssm maintenance period.

@hross-frae
Copy link

I am also experiencing the same problem with the scheduler immediately turning off an instance after it has been start for a maintenance window.

@chaitand28
Copy link

This issue has been fixed in the release 1.3.1. Please deploy the latest template to get the updated code.

@mahammadism
Copy link

mahammadism commented Jan 5, 2021

Hi,
I am working Instance scheduler and ssm maintenance window for the first time. So, my requirement is to enable ssm maintenance window for instance scheduler and need to start instances 2 hours before ssm maintenance window task execution.
Please guide me on how we can enable ssm maintenance window in instance scheduler using cloudforamtion.

Thanks,
Ismail. S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants