Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use healthcheck cron job to determine cluster readiness #743
Use healthcheck cron job to determine cluster readiness #743
Changes from 6 commits
26f7164
fd7fb75
f8f10e6
420e4f1
cd7b67e
c6dc076
2c1a44c
1325f9e
0d93eec
8fef0ba
359eae4
dd9dac3
6de4970
7dda5b8
71de624
e0e0143
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with client-go, but why doesn't
bv1C.Jobs(namespace).Get(ctx, name, metav1.GetOptions{})
work here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh... It does. I just did it a dumb way because I too am not very familiar with client-go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait! On second thought, it does need to be a list. I need the resourceVersion field of the list itself in order to initiate the watch later. This allows me to detect the creation and deletion of jobs matching my criteria since I performed the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand.
If you use a zero resourceVersion, do you even need the initial get/list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle "completed but failed" as well.
...But how we handle it is different before vs after openshift/configure-alertmanager-operator#143
Need to think through this some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait to hear your updated thoughts on how this should work before altering this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked this through with @jharrington22 just now. We're going to rework the readiness side so the picture is clear. Specific to this case, I think for osde2e's purposes "completed but failed" will translate to "cluster not ready, ain't ever gonna be".
But I'm also pretty sure we're going to use prometheus to carry that state... which would mean substantial changes in this PR. Stay tuned -- I hope to have something written up by tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep an eye on OSD-6646 starting here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Completed but failed" should be treated as "cluster not ready, ain't ever gonna be". I'll post more info in a top-level comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should accommodate deletion of the Job and keep looping. As soon as openshift/configure-alertmanager-operator#143 lands, deletion will be a normal part of the flow (see here).
A bit more explanation: We'll delete the Job if it fails. This does not mean health checks failed -- it means something went horribly wrong under the covers, like we failed to talk to k8s or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something went that wrong, I doubt that we'd want to proceed with testing, so I think that this behavior is probably okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't very clear: a failing Job is not unusual early in the cluster's life. A common thing we used to see was prometheus not being up yet. Evictions can also be a cause. The point of having the Job owned by some kind of controller was so we could mitigate this kind of thing.
That said, we could keep quite a bit of that level of retry logic in the Job itself via
backoffLimit
/activeDeadlineSeconds
rather than having the controller react to failure. That would allow both sides to consider a failed Job "fatal" in whatever sense is appropriate.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this: For now, a deletion should never happen (except manually). So please consider it an immediate failure (as if the cluster will never be ready) -- in other words, the logic is correct as written.
Ideally, please ping me and let me have a poke at the cluster if you do see this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to make this timeout configurable? It is in c-am-o FWIW. (This could be a separate PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it, but I don't think we have a real use-case for configuring it. YAGNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC there's a zillion other config knobs around this, including the clean count, success sleep, and failure sleep. But I'm sure you're right -- are the other knobs ever used?
[Later] Actually, the use case is for testing, e.g. if you want to make sure the timeout code path works properly without having to wait 2h or edit code. I made extensive use of this in osd-cluster-ready itself when it had a similar tunable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2uasimojo I've added a config knob for this and tested it.