-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[openshift] add support for recycling StatefulSet #1627
Conversation
I don't see tests, still, CI seems happy. Is that expected? Saw some discussion about not reducing the coverage, this clearly reduces it. Are we in a grace period or something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this with StatefulSets scares the **** out of me, but this approach is cool. The new SS doesn't do anything as it already has the number of pods it needs. Great.
I would add a good chunk of documentation (basically what you have in the description)
Another thing I don't know how to fix is pods getting a long time to get deleted. I remember Telemeter pods taking a long time to get rebooted because they needed to wait a lot to upload something. If that's something that is in the Pod spec, at least we should wait that time before declaring the Pod not ready...
@rporres do note that when we apply the new STS, it is not happy. this is why we need to delete the pods one by one. after each deletion, the new STS will kick in and create the pod (we are just deleting an orphaned one from the previous STS). about pods that take a long time to be deleted - this is why i use |
That's my understanding too of what you're trying to achieve here |
@apahim |
Thanos receive pods have I don't think retry time should be configurable as it would mean duplicating configuration that the pods already have What I don't know yet how to achieve is to avoid waiting the This entry is concise and helpful about the mechanism. |
oh, so we have some fat to burn, got it. Thanks! |
@rporres i think i gave you a wrong answer. the waiting time is for a new pod to become ready. waiting for the pod to be deleted is a part of the |
changed approach a bit to simplify and not not decide between cascading or not. |
Yes, I also think you're right. The other times we could take into account is the readiness/startup probes ones, but I would leave that for another iteration as it can be quite complicated to set up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nits, this looks good otherwise :)
self.validate_pod_ready(namespace, name) | ||
|
||
@retry(max_attempts=20) | ||
def validate_pod_ready(self, namespace, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Methods that check some status and force the program to halt if it doesn't meet some conditions are usually assertions. In English it reads clearer if this is called assert_pod_ready(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have an old itch from my days as a test automation developer that assertions only happen in tests, and using "assert" in a function name indicates using an assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in my eyes this acts like one: it is called in the middle of some processing and breaks the flow entirely by raising an exception/aborting whatever if some consistency condition isn't met. It is true that this one isn't meant to be disabled with -DNDEBUG
.
It was a nit, just that I like it when the naming makes it clear that this is an integrity check and that normal flow of the program can break here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By any reasonable standard, apply
has become too long. We need to plan to split it.
Then, I am very uncomfortable with the fact that code reductions in other areas are being used as an excuse to not have a single test here. There is a lot of value for future iterations on testing at least get_owned_pods
and validate_pod_ready
. Simple tests (warning! untested!):
from unittest import TestCase
from unittest.mock import patch, MagicMock
import reconcile.utils.oc as oc
class TestGetOwnedPods(TestCase):
@mock.patch.object(oc.OC, 'get')
@mock.patch.object(oc.OC, 'get_obj_root_owner')
def test_get_owned_pods(self, get_owner, oc_get):
oc_get.return_value = {
'items': ['pod1', 'pod2', 'pod3']
}
get_owner.side_effect = [
{
'kind': 'mykind',
'metadata': {'name': 'myname'}
},
{
'kind': 'notmykind',
'metadata': {'name': 'notmyname'},
}
{
'kind': 'mykind',
'metadata': {'name': 'notmyname'}
}
]
resource = MagicMock(kind='mykind', name='myname')
o = OC()
pods = o.get_owned_pods('namespace', resource)
self.assertEqual(pods, ['pod1'])
class TestValidatePodReady(TestCase):
@mock.patch.object(oc.OC, 'get')
def test_validate_pod_ready_all_good(self, oc_get):
oc_get.return_value = {
'status': {
'containerStatuses': [
{
'name': 'container1',
'ready': True,
},
{
'name': 'container2',
'ready': True
}
]
}
}
o = OC()
# Bypass the retry stuff
o.validate_pod_ready.__wrapped__('namespace', 'podname')
# Feel free to add assertions around the call to oc_get, but
# they aren't important on this test
@mock.patch.object(oc.OC, 'get')
def test_validate_pod_ready_one_missing(self, oc_get):
oc_get.return_value = {
'status': {
'containerStatuses': [
{
'name': 'container1',
'ready': True,
},
{
'name': 'container2',
'ready': False
}
]
}
}
o = OC()
with self.assertRaises(oc.PodNotReadyError):
o.validate_pod_ready.__wrapped__('namespace', 'podname')
Simple stuff like this will give the next person opening this file (which may be you in 6 months) a lot of confidence on what they're doing. And they will also speed up your own development iterations massively.
Would you consider these 2 test cases for inclusion?
added tests! thanks for helping get them started! about i agree that this section is not pretty: qontract-reconcile/reconcile/openshift_base.py Lines 216 to 225 in 4f9dcbd
but otherwise, how would you refactor it? |
in any case, we can discuss this in jira: https://issues.redhat.com/browse/APPSRE-3355 proceeding to merge this. |
part of https://issues.redhat.com/browse/APPSRE-3351
this PR adds support to apply every additional changes to StatefulSets without incurring any downtime.
When we detect a fields immutable error for a StatefulSet: