Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix EventService test failures #154

Merged
merged 2 commits into from
Oct 21, 2021

Conversation

yitsushi
Copy link
Contributor

@yitsushi yitsushi commented Oct 20, 2021

I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes #115

Checklist:

  • squashed commits into logical changes
  • includes documentation

@yitsushi yitsushi added kind/bug Something isn't working area/testing Indicates an issue related to test labels Oct 20, 2021
@yitsushi yitsushi marked this pull request as draft October 20, 2021 14:17
@yitsushi
Copy link
Contributor Author

yitsushi commented Oct 20, 2021

It does not fix it XD it does a painful crash after 2 minutes. Obviously I can't see this on local machine.

@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2021

Codecov Report

Merging #154 (e6ce829) into main (12b503c) will decrease coverage by 0.45%.
The diff coverage is 28.04%.

❗ Current head e6ce829 differs from pull request most recent head e645e2e. Consider uploading reports for the commit e645e2e to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #154      +/-   ##
==========================================
- Coverage   40.09%   39.64%   -0.46%     
==========================================
  Files          32       36       +4     
  Lines        1444     1695     +251     
==========================================
+ Hits          579      672      +93     
- Misses        789      926     +137     
- Partials       76       97      +21     
Impacted Files Coverage Δ
core/application/app.go 100.00% <ø> (ø)
core/application/errors.go 0.00% <0.00%> (ø)
core/application/reconcile.go 0.00% <0.00%> (ø)
core/plans/microvm_update.go 0.00% <0.00%> (ø)
infrastructure/containerd/convert.go 51.21% <0.00%> (ø)
infrastructure/containerd/errors.go 0.00% <0.00%> (ø)
infrastructure/containerd/lease.go 51.85% <0.00%> (-18.15%) ⬇️
infrastructure/firecracker/config.go 0.00% <0.00%> (ø)
infrastructure/firecracker/create.go 0.00% <0.00%> (ø)
infrastructure/firecracker/provider.go 0.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2941cce...e645e2e. Read the comment docs.

@yitsushi yitsushi marked this pull request as ready for review October 20, 2021 17:43
@yitsushi
Copy link
Contributor Author

yitsushi commented Oct 20, 2021

squash and better description: coming soon

@yitsushi yitsushi force-pushed the 115-event-service-fix branch from 6f4201b to b70bad3 Compare October 20, 2021 18:06
@yitsushi
Copy link
Contributor Author

Nope still not... why

@yitsushi yitsushi force-pushed the 115-event-service-fix branch from b70bad3 to 2d86efb Compare October 20, 2021 18:56
@yitsushi
Copy link
Contributor Author

Now it seems it's happy.

@yitsushi yitsushi requested a review from richardcase October 20, 2021 19:07
I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes liquidmetal-dev#115
@yitsushi yitsushi force-pushed the 115-event-service-fix branch from 2d86efb to b6f1379 Compare October 20, 2021 19:10
@yitsushi
Copy link
Contributor Author

So the sleep when we subscribe to events, it can be slow (miliseconds) to establish the connection to containerd, the wg waits only for the go routine that send out the request to subscribe, but it can't track if it's ready or not. That's why the small sleep is in there, 10 microseconds should be enough.

Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just the question around assertion packages.


. "github.com/onsi/gomega"
"github.com/stretchr/testify/assert"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need 2 assertion packages. If the consensus is testtify/assert then i'm good with that. I like gomega but not so much that i wouldn't be happy using something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it because it was already in there (no addition in go.mod)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only issue with the way gomega is used, I don't like the . imports because they can mess with the environment, they are packages and should be used as packages and not a pack of function injected into our test module.

Copy link
Contributor Author

@yitsushi yitsushi Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test to use only Gomega.

@yitsushi yitsushi merged commit ddca2ef into liquidmetal-dev:main Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Indicates an issue related to test kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Event service integration test flake
3 participants