[openshift] add support for recycling StatefulSet #1627

maorfr · 2021-06-10T08:37:30Z

part of https://issues.redhat.com/browse/APPSRE-3351

Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

this PR adds support to apply every additional changes to StatefulSets without incurring any downtime.

When we detect a fields immutable error for a StatefulSet:

delete statefulset with cascade false to prevent the pods from being deleted
apply the updated statefulset
delete one pod at a time and wait for the new statefulset to bring it up

reconcile/openshift_base.py

apahim · 2021-06-10T09:08:08Z

I don't see tests, still, CI seems happy. Is that expected? Saw some discussion about not reducing the coverage, this clearly reduces it. Are we in a grace period or something?

rporres

Doing this with StatefulSets scares the **** out of me, but this approach is cool. The new SS doesn't do anything as it already has the number of pods it needs. Great.

I would add a good chunk of documentation (basically what you have in the description)

Another thing I don't know how to fix is pods getting a long time to get deleted. I remember Telemeter pods taking a long time to get rebooted because they needed to wait a lot to upload something. If that's something that is in the Pod spec, at least we should wait that time before declaring the Pod not ready...

reconcile/utils/oc.py

maorfr · 2021-06-10T09:45:28Z

@rporres do note that when we apply the new STS, it is not happy. this is why we need to delete the pods one by one. after each deletion, the new STS will kick in and create the pod (we are just deleting an orphaned one from the previous STS).

about pods that take a long time to be deleted - this is why i use 20 as the max attempts for the pod ready validation. 20 attempts is 210 seconds (1 + 2 + ... + 20) which i hope would be enough. if not, we can either extend the attempts here or add support through the saas file for a configurable number (maybe even use timeout).

rporres · 2021-06-10T09:55:06Z

do note that when we apply the new STS, it is not happy. this is why we need to delete the pods one by one. after each deletion, the new STS will kick in and create the pod (we are just deleting an orphaned one from the previous STS).

That's my understanding too of what you're trying to achieve here

maorfr · 2021-06-10T10:06:28Z

@apahim Required test coverage of 19.0% reached. Total coverage: 19.14%

rporres · 2021-06-10T10:07:46Z

about pods that take a long time to be deleted - this is why i use 20 as the max attempts for the pod ready validation. 20 attempts is 210 seconds (1 + 2 + ... + 20) which i hope would be enough. if not, we can either extend the attempts here or add support through the saas file for a configurable number (maybe even use timeout).

Thanos receive pods have terminationGracePeriodSeconds set to 900 seconds, so our retry time has to take that into account or we will declare pods as not ready while they're behaving as configured.

I don't think retry time should be configurable as it would mean duplicating configuration that the pods already have

What I don't know yet how to achieve is to avoid waiting the terminationGracePeriodSeconds if the pod has already been properly terminated so that we don't have to wait more than we need 🤔

This entry is concise and helpful about the mechanism.

https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

apahim · 2021-06-10T10:22:29Z

@apahim Required test coverage of 19.0% reached. Total coverage: 19.14%

oh, so we have some fat to burn, got it. Thanks!

maorfr · 2021-06-10T10:59:08Z

@rporres i think i gave you a wrong answer. the waiting time is for a new pod to become ready. waiting for the pod to be deleted is a part of the oc delete step, and i believe we get that for free. i mean - oc delete itself will wait for the pod to terminate before returning.

maorfr · 2021-06-10T11:25:38Z

changed approach a bit to simplify and not not decide between cascading or not.

rporres · 2021-06-10T11:27:06Z

waiting for the pod to be deleted is a part of the oc delete step, and i believe we get that for free

Yes, I also think you're right. oc waits for the pod to be deleted to return, so I also believe that we get that for free

The other times we could take into account is the readiness/startup probes ones, but I would leave that for another iteration as it can be quite complicated to set up.

rporres

two nits, this looks good otherwise :)

reconcile/openshift_base.py

reconcile/utils/oc.py

Piojo · 2021-06-10T20:42:16Z

reconcile/utils/oc.py

+            self.validate_pod_ready(namespace, name)
+
+    @retry(max_attempts=20)
+    def validate_pod_ready(self, namespace, name):


Nit: Methods that check some status and force the program to halt if it doesn't meet some conditions are usually assertions. In English it reads clearer if this is called assert_pod_ready(...).

i have an old itch from my days as a test automation developer that assertions only happen in tests, and using "assert" in a function name indicates using an assertion?

Yeah, in my eyes this acts like one: it is called in the middle of some processing and breaks the flow entirely by raising an exception/aborting whatever if some consistency condition isn't met. It is true that this one isn't meant to be disabled with -DNDEBUG.

It was a nit, just that I like it when the naming makes it clear that this is an integrity check and that normal flow of the program can break here.

Piojo

By any reasonable standard, apply has become too long. We need to plan to split it.

Then, I am very uncomfortable with the fact that code reductions in other areas are being used as an excuse to not have a single test here. There is a lot of value for future iterations on testing at least get_owned_pods and validate_pod_ready. Simple tests (warning! untested!):

from unittest import TestCase
from unittest.mock import patch, MagicMock

import reconcile.utils.oc as oc

class TestGetOwnedPods(TestCase):
    @mock.patch.object(oc.OC, 'get')
    @mock.patch.object(oc.OC, 'get_obj_root_owner')
    def test_get_owned_pods(self, get_owner, oc_get):

        oc_get.return_value = {
            'items': ['pod1', 'pod2', 'pod3']
        }
        get_owner.side_effect = [
            {
                'kind': 'mykind',
                'metadata': {'name': 'myname'}
            },
            {
                'kind': 'notmykind',
                'metadata': {'name': 'notmyname'},
            }
            {
                'kind': 'mykind',
                'metadata': {'name': 'notmyname'}
            }
        ]

        resource = MagicMock(kind='mykind', name='myname')

        o = OC()
        pods = o.get_owned_pods('namespace', resource)
        self.assertEqual(pods, ['pod1'])

class TestValidatePodReady(TestCase):
    @mock.patch.object(oc.OC, 'get')
    def test_validate_pod_ready_all_good(self, oc_get):
        oc_get.return_value = {
            'status': {
                'containerStatuses': [
                    {
                        'name': 'container1',
                        'ready': True,
                    },
                    {
                        'name': 'container2',
                        'ready': True
                    }
                ]
            }
        }
        o = OC()
        # Bypass the retry stuff
        o.validate_pod_ready.__wrapped__('namespace', 'podname')
        # Feel free to add assertions around the call to oc_get, but
        # they aren't important on this test

    @mock.patch.object(oc.OC, 'get')
    def test_validate_pod_ready_one_missing(self, oc_get):
        oc_get.return_value = {
            'status': {
                'containerStatuses': [
                    {
                        'name': 'container1',
                        'ready': True,
                    },
                    {
                        'name': 'container2',
                        'ready': False
                    }
                ]
            }
        }

        o = OC()
        with self.assertRaises(oc.PodNotReadyError):
            o.validate_pod_ready.__wrapped__('namespace', 'podname')

Simple stuff like this will give the next person opening this file (which may be you in 6 months) a lot of confidence on what they're doing. And they will also speed up your own development iterations massively.

Would you consider these 2 test cases for inclusion?

maorfr · 2021-06-13T08:46:35Z

added tests! thanks for helping get them started!

about apply's length - can you say exactly what makes it too long?
it does one thing - oc apply, and it does it well (including catching errors and overcoming them), it has very few local variables, it has less than 4 levels of indentation. it does have a lot of arguments, but not too many?

i agree that this section is not pretty:

qontract-reconcile/reconcile/openshift_base.py

Lines 216 to 225 in 4f9dcbd

    
           # skip if namespace does not exist (as it will soon) 
        
           # do not skip if this is a cluster scoped integration 
        
           if namespace != 'cluster' and not oc.project_exists(namespace): 
        
               msg = f"[{cluster}/{namespace}] namespace does not exist (yet)." 
        
               if wait_for_namespace: 
        
                   logging.info(msg + ' waiting...') 
        
                   wait_for_namespace_exists(oc, namespace) 
        
               else: 
        
                   logging.warning(msg) 
        
                   return

but otherwise, how would you refactor it?

maorfr · 2021-06-14T06:12:13Z

in any case, we can discuss this in jira: https://issues.redhat.com/browse/APPSRE-3355

proceeding to merge this.

maorfr requested review from rporres, jmelis and apahim June 10, 2021 08:50

apahim reviewed Jun 10, 2021

View reviewed changes

reconcile/openshift_base.py Outdated Show resolved Hide resolved

rporres reviewed Jun 10, 2021

View reviewed changes

reconcile/utils/oc.py Outdated Show resolved Hide resolved

reconcile/utils/oc.py Show resolved Hide resolved

reconcile/utils/oc.py Show resolved Hide resolved

reconcile/utils/oc.py Outdated Show resolved Hide resolved

rporres approved these changes Jun 10, 2021

View reviewed changes

reconcile/openshift_base.py Outdated Show resolved Hide resolved

reconcile/openshift_base.py Outdated Show resolved Hide resolved

Piojo reviewed Jun 10, 2021

View reviewed changes

maorfr force-pushed the recycle-sts branch from 3c546ff to 839fbf6 Compare June 13, 2021 07:41

maorfr added 11 commits June 14, 2021 09:03

openshift support for recycling StatefulSet

7502fe6

fix bracket

f5147d7

use supported_resource_types mapping

593f58f

rename recycle function

450553c

add recycling log

24902f6

add more logging

ec9a95f

more logging

53a84dc

added comments

ea65213

get pods before deleting sts + fix raise section

de3d5d2

add specific exception

b7805a7

import

d6c9ab7

maorfr added 10 commits June 14, 2021 09:03

update logging

d77651e

check resource type comparison

860b732

update get_owned_pod pod match from negative to positive

fe03af7

use method name to log

4d15d6f

fix function call

6fcf3e7

add validate_pod_ready tests

8174286

add get_owned_pods test

e8992c0

use OpenshiftResource instead of MagicMock

2e91bdb

remove line

f4abb55

add get_obj_root_owner tests

cd7574e

maorfr force-pushed the recycle-sts branch from b97d823 to cd7574e Compare June 14, 2021 06:04

maorfr merged commit 3891762 into app-sre:master Jun 14, 2021

maorfr mentioned this pull request Jun 14, 2021

[openshift-saas-deploy] validate StatefulSet after deployment #1635

Merged

JoaoBraveCoding mentioned this pull request Apr 17, 2023

logs: bump jsonnet dep to start using ConfigMap and Services from loki.mixins rhobs/configuration#450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[openshift] add support for recycling StatefulSet #1627

[openshift] add support for recycling StatefulSet #1627

maorfr commented Jun 10, 2021 •

edited

Loading

apahim commented Jun 10, 2021

rporres left a comment

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021 •

edited

Loading

apahim commented Jun 10, 2021

maorfr commented Jun 10, 2021

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021

rporres left a comment

Piojo Jun 10, 2021

maorfr Jun 13, 2021

Piojo Jun 14, 2021 •

edited

Loading

Piojo left a comment

maorfr commented Jun 13, 2021

maorfr commented Jun 14, 2021

[openshift] add support for recycling StatefulSet #1627

[openshift] add support for recycling StatefulSet #1627

Conversation

maorfr commented Jun 10, 2021 • edited Loading

apahim commented Jun 10, 2021

rporres left a comment

Choose a reason for hiding this comment

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021 • edited Loading

apahim commented Jun 10, 2021

maorfr commented Jun 10, 2021

maorfr commented Jun 10, 2021

rporres commented Jun 10, 2021

rporres left a comment

Choose a reason for hiding this comment

Piojo Jun 10, 2021

Choose a reason for hiding this comment

maorfr Jun 13, 2021

Choose a reason for hiding this comment

Piojo Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Piojo left a comment

Choose a reason for hiding this comment

maorfr commented Jun 13, 2021

maorfr commented Jun 14, 2021

maorfr commented Jun 10, 2021 •

edited

Loading

rporres commented Jun 10, 2021 •

edited

Loading

Piojo Jun 14, 2021 •

edited

Loading