Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add retry counter on failed reconciliation #255

Merged
merged 16 commits into from
Nov 19, 2021

Conversation

yitsushi
Copy link
Contributor

@yitsushi yitsushi commented Nov 12, 2021

What this PR does / why we need it:

If reconciliation failed, increase the Retry counter, set the NotBefore
field to a future date and reschedule a retry.

A Go routine handles the force retry because the system tries to
reconcile only if an event tell the system to do that or with the fixed
periodical Resync (which is slow for that).

Because we never tracked if a MicroVM was able to boot or not, we just
let the reconciler to check if the process is not there and react to the
results. In case the MicroVM was not able to boot, we reported back a
success on the MicroVM start step, which is not right and we can't track
failed state with that. As a solution, now a step has a Verify function
that will be called after Do. If the result is false, it marks the step
failed. That way we can start the MicroVM, wait a bit and check if it's
still running, if it's not running, the start failed.

Which issue(s) this PR fixes:
Fixes #232
Fixes #178

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@yitsushi yitsushi added the kind/feature New feature or request label Nov 12, 2021
If reconciliation failed, increase the Retry counter, set the NotBefore
field to a future date and reschedule a retry.

A Go routine handles the force retry because the system tries to
reconcile only if an event tell the system to do that or with the fixed
periodical Resync (which is slow for that).

Because we never tracked if a MicroVM was able to boot or not, we just
let the reconciler to check if the process is not there and react to the
results. In case the MicroVM was not able to boot, we reported back a
success on the MicroVM start step, which is not right and we can't track
failed state with that. As a solution, now a step has a Verify function
that will be called after Do. If the result is false, it marks the step
failed. That way we can start the MicroVM, wait a bit and check if it's
still running, if it's not running, the start failed.
If one fails, we can still listen on new requests and reconcile vms, if
they are failing always, the retry logic will handle this.
@yitsushi
Copy link
Contributor Author

Did not add test on reconsile.go, we have zero tests for it and I think it's out of scope to write the full test for it, and filed an issue #264

@yitsushi yitsushi marked this pull request as ready for review November 16, 2021 10:22
@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2021

Codecov Report

Merging #255 (339ca11) into main (aa5916e) will increase coverage by 0.24%.
The diff coverage is 45.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #255      +/-   ##
==========================================
+ Coverage   40.29%   40.54%   +0.24%     
==========================================
  Files          46       46              
  Lines        2169     2242      +73     
==========================================
+ Hits          874      909      +35     
- Misses       1236     1272      +36     
- Partials       59       61       +2     
Impacted Files Coverage Δ
core/application/reconcile.go 0.00% <0.00%> (ø)
core/plans/microvm_create_update.go 57.53% <0.00%> (ø)
infrastructure/controllers/microvm_controller.go 70.00% <0.00%> (ø)
pkg/planner/actuator.go 83.33% <0.00%> (-2.88%) ⬇️
core/steps/event/publish.go 100.00% <100.00%> (ø)
core/steps/microvm/create.go 100.00% <100.00%> (ø)
core/steps/microvm/delete.go 100.00% <100.00%> (ø)
core/steps/microvm/start.go 100.00% <100.00%> (ø)
core/steps/network/interface_create.go 93.44% <100.00%> (+0.22%) ⬆️
core/steps/network/interface_delete.go 91.30% <100.00%> (+0.39%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e0abc6...339ca11. Read the comment docs.

core/application/reconcile.go Outdated Show resolved Hide resolved
core/application/reconcile.go Show resolved Hide resolved
core/steps/microvm/start.go Outdated Show resolved Hide resolved
@richardcase richardcase mentioned this pull request Nov 17, 2021
4 tasks
Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nit...feel free to resolve them depending on what you think.

core/models/microvm.go Outdated Show resolved Hide resolved
core/steps/microvm/start.go Outdated Show resolved Hide resolved
pkg/planner/planner.go Outdated Show resolved Hide resolved
@@ -44,25 +44,32 @@ func TestNewStartStep(t *testing.T) {
ctx := context.Background()
vm := testVMToStart()

step := microvm.NewStartStep(vm, microVMService)
step := microvm.NewStartStep(vm, microVMService, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we var these 1s here? just so it is easy to know at a glance what this param is meant to be

@@ -59,6 +59,8 @@ type MicroVMStatus struct {
NetworkInterfaces NetworkInterfaceStatuses `json:"network_interfaces"`
// Retry is a counter about how many times we retried to reconcile.
Retry int `json:"retry"`
// NotBefore tells the system to do not reconsile until given timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teeny nit: reconcile with a c

core/steps/microvm/start_test.go Outdated Show resolved Hide resolved
core/application/reconcile.go Outdated Show resolved Hide resolved
Comment on lines 99 to 101
logger.Info("Wait to emit update")
time.Sleep(sleepTime)
logger.Info("Emit update")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to be a bit more clear on these logs?

like "waiting to publish update event" or "waiting to reschedule for update"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we even need these lines, added them while i was working on it.

time.Sleep(sleepTime)
logger.Info("Emit update")

_ = a.ports.EventService.Publish(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not care about logging this err? could imagine that being an annoying one to solve if it failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentionally ignored it. This go routine is a steroid on the system, if it fails next resync will do the trick. But we can log. Hopefully ppl don't get confused reading the log because it happens "random" as it's not in the flow (async go routine).

execCtx := portsctx.WithPorts(ctx, a.ports)

executionID, err := a.ports.IdentifierService.GenerateRandom()
if err != nil {
if scheduleErr := a.reschedule(ctx, localLogger, spec); scheduleErr != nil {
return fmt.Errorf("saving spec after plan failed: %w", scheduleErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more relevant error message? (ditto line 150)

Copy link
Member

@Callisto13 Callisto13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work 🎉

@yitsushi yitsushi merged commit 6ec5c7d into liquidmetal-dev:main Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retry counter on failed reconciliation Flintlock should know when a mVM dies
4 participants