-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add retry counter on failed reconciliation #255
Conversation
If reconciliation failed, increase the Retry counter, set the NotBefore field to a future date and reschedule a retry. A Go routine handles the force retry because the system tries to reconcile only if an event tell the system to do that or with the fixed periodical Resync (which is slow for that). Because we never tracked if a MicroVM was able to boot or not, we just let the reconciler to check if the process is not there and react to the results. In case the MicroVM was not able to boot, we reported back a success on the MicroVM start step, which is not right and we can't track failed state with that. As a solution, now a step has a Verify function that will be called after Do. If the result is false, it marks the step failed. That way we can start the MicroVM, wait a bit and check if it's still running, if it's not running, the start failed.
ae70166
to
90b7bed
Compare
If one fails, we can still listen on new requests and reconcile vms, if they are failing always, the retry logic will handle this.
90b7bed
to
e48e1d7
Compare
Did not add test on |
6ab67a2
to
7ec6979
Compare
Codecov Report
@@ Coverage Diff @@
## main #255 +/- ##
==========================================
+ Coverage 40.29% 40.54% +0.24%
==========================================
Files 46 46
Lines 2169 2242 +73
==========================================
+ Hits 874 909 +35
- Misses 1236 1272 +36
- Partials 59 61 +2
Continue to review full report at Codecov.
|
This reverts commit 741f742.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nit...feel free to resolve them depending on what you think.
core/steps/microvm/start_test.go
Outdated
@@ -44,25 +44,32 @@ func TestNewStartStep(t *testing.T) { | |||
ctx := context.Background() | |||
vm := testVMToStart() | |||
|
|||
step := microvm.NewStartStep(vm, microVMService) | |||
step := microvm.NewStartStep(vm, microVMService, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we var these 1
s here? just so it is easy to know at a glance what this param is meant to be
core/models/microvm.go
Outdated
@@ -59,6 +59,8 @@ type MicroVMStatus struct { | |||
NetworkInterfaces NetworkInterfaceStatuses `json:"network_interfaces"` | |||
// Retry is a counter about how many times we retried to reconcile. | |||
Retry int `json:"retry"` | |||
// NotBefore tells the system to do not reconsile until given timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teeny nit: reconcile
with a c
core/application/reconcile.go
Outdated
logger.Info("Wait to emit update") | ||
time.Sleep(sleepTime) | ||
logger.Info("Emit update") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to be a bit more clear on these logs?
like "waiting to publish update event" or "waiting to reschedule for update"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we even need these lines, added them while i was working on it.
core/application/reconcile.go
Outdated
time.Sleep(sleepTime) | ||
logger.Info("Emit update") | ||
|
||
_ = a.ports.EventService.Publish( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not care about logging this err? could imagine that being an annoying one to solve if it failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, intentionally ignored it. This go routine is a steroid on the system, if it fails next resync will do the trick. But we can log. Hopefully ppl don't get confused reading the log because it happens "random" as it's not in the flow (async go routine).
core/application/reconcile.go
Outdated
execCtx := portsctx.WithPorts(ctx, a.ports) | ||
|
||
executionID, err := a.ports.IdentifierService.GenerateRandom() | ||
if err != nil { | ||
if scheduleErr := a.reschedule(ctx, localLogger, spec); scheduleErr != nil { | ||
return fmt.Errorf("saving spec after plan failed: %w", scheduleErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more relevant error message? (ditto line 150)
Co-authored-by: Claudia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work 🎉
What this PR does / why we need it:
If reconciliation failed, increase the Retry counter, set the NotBefore
field to a future date and reschedule a retry.
A Go routine handles the force retry because the system tries to
reconcile only if an event tell the system to do that or with the fixed
periodical Resync (which is slow for that).
Because we never tracked if a MicroVM was able to boot or not, we just
let the reconciler to check if the process is not there and react to the
results. In case the MicroVM was not able to boot, we reported back a
success on the MicroVM start step, which is not right and we can't track
failed state with that. As a solution, now a step has a Verify function
that will be called after Do. If the result is false, it marks the step
failed. That way we can start the MicroVM, wait a bit and check if it's
still running, if it's not running, the start failed.
Which issue(s) this PR fixes:
Fixes #232
Fixes #178
Special notes for your reviewer:
Checklist: