Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event service integration test flake #115

Closed
richardcase opened this issue Oct 7, 2021 · 4 comments · Fixed by #154
Closed

Event service integration test flake #115

richardcase opened this issue Oct 7, 2021 · 4 comments · Fixed by #154
Labels
area/testing Indicates an issue related to test kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@richardcase
Copy link
Member

richardcase commented Oct 7, 2021

What happened:
The "integration" tests for the event service aren't very stable and often have to be re-run as we get an error:

=== RUN   TestEventService_Integration
    event_service_test.go:29: creating subscribers
    event_service_test.go:61: subscribers waiting for events
    event_service_test.go:58: finished publishing events
    event_service_test.go:94: rpc error: code = Canceled desc = context canceled
--- FAIL: TestEventService_Integration (0.01s)

What did you expect to happen:
I expect them to pass consistently

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • reignite version:
  • OS (e.g. from /etc/os-release):
@richardcase richardcase added kind/bug Something isn't working area/testing Indicates an issue related to test labels Oct 7, 2021
@richardcase richardcase added this to the v0.1.0 milestone Oct 7, 2021
@yitsushi
Copy link
Contributor

yitsushi commented Oct 8, 2021

If I'm correct, it's fixed by #120
Did not see we have an issue for it.

I think this one can be closed.

@richardcase
Copy link
Member Author

We appear to still be having issues with the even service tests.

@yitsushi
Copy link
Contributor

I did not see integration test failures for it. Can you link a failed build please?

@richardcase
Copy link
Member Author

Heres one:
https://github.com/weaveworks/reignite/runs/3884860022?check_suite_focus=true#step:7:176

@richardcase richardcase added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 19, 2021
yitsushi added a commit to yitsushi/flintlock that referenced this issue Oct 20, 2021
I was not able to reproduce the issue on my machine (why would I), first
I did a bit of cleanup to reduce the nested definitions. It's a bit
easier to follow now.

Added extra logging about contexts, so we can know which one throws the
`code = Canceled desc = context canceled` error.

Created a PR on my fork and restarted the test job 4 times, none of them
failed.

I assume the real fix is break after `subscriber.cancel()`. I'm not 100%
convinced, but potentially when we close the context and next time
checking for `eventCh` or `eventErrCh`, theoretically both of them are
closed when the next loop starts, but we are talking very short CPU
cycles, so anything can happen.

Related to liquidmetal-dev#115

Intentionally not marking wth `Fixes #`.
yitsushi added a commit to yitsushi/flintlock that referenced this issue Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes liquidmetal-dev#115
yitsushi added a commit to yitsushi/flintlock that referenced this issue Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes liquidmetal-dev#115
yitsushi added a commit to yitsushi/flintlock that referenced this issue Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes liquidmetal-dev#115
yitsushi added a commit that referenced this issue Oct 21, 2021
* Fix EventService test

I was able to reproduce the issue with a helping had from cgroup.
Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the
issue.

Note before I describe what happened:
Containerd does not send messages to subscribers retrospectively, all
subscribers will receive events from the point they subscribed.

The original test created published N events and after that created
subscribers. The connection between the test and containerd is much
slower than the execution of the test, so when containrd wants to send
out the events to all subscribers, they are already there to receive
events. That's why it works on my machine and that's why it can pass on
Github Actions sometimes.

However, on a slow machine, with only one vcpu, the test and containerd
are racing for their own cpu share. In this scenario, the events are
already published before the subscribers are ready to receive them.

Solution:
Create subscribers first and then publish events.

Disclaimer:
There is a chance, all I wrote above is not entire correct, but that's
how I understand it. It does not really matter if we only check the
logic in the test. Originally it was publish->subscribe, but it's not
correct, we need subscribers before we publish events.

Fixes #115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Indicates an issue related to test kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants