Event service integration test flake #115

richardcase · 2021-10-07T07:20:56Z

What happened:
The "integration" tests for the event service aren't very stable and often have to be re-run as we get an error:

=== RUN   TestEventService_Integration
    event_service_test.go:29: creating subscribers
    event_service_test.go:61: subscribers waiting for events
    event_service_test.go:58: finished publishing events
    event_service_test.go:94: rpc error: code = Canceled desc = context canceled
--- FAIL: TestEventService_Integration (0.01s)

What did you expect to happen:
I expect them to pass consistently

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

reignite version:
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

yitsushi · 2021-10-08T15:23:35Z

If I'm correct, it's fixed by #120
Did not see we have an issue for it.

I think this one can be closed.

richardcase · 2021-10-14T05:05:53Z

We appear to still be having issues with the even service tests.

yitsushi · 2021-10-14T08:10:49Z

I did not see integration test failures for it. Can you link a failed build please?

richardcase · 2021-10-14T08:16:52Z

Heres one:
https://github.com/weaveworks/reignite/runs/3884860022?check_suite_focus=true#step:7:176

I was not able to reproduce the issue on my machine (why would I), first I did a bit of cleanup to reduce the nested definitions. It's a bit easier to follow now. Added extra logging about contexts, so we can know which one throws the `code = Canceled desc = context canceled` error. Created a PR on my fork and restarted the test job 4 times, none of them failed. I assume the real fix is break after `subscriber.cancel()`. I'm not 100% convinced, but potentially when we close the context and next time checking for `eventCh` or `eventErrCh`, theoretically both of them are closed when the next loop starts, but we are talking very short CPU cycles, so anything can happen. Related to liquidmetal-dev#115 Intentionally not marking wth `Fixes #`.

I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes liquidmetal-dev#115

* Fix EventService test I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes #115

richardcase added kind/bug Something isn't working area/testing Indicates an issue related to test labels Oct 7, 2021

richardcase added this to the v0.1.0 milestone Oct 7, 2021

richardcase added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 19, 2021

yitsushi mentioned this issue Oct 20, 2021

Try to fix EventService test failures #154

Merged

2 tasks

yitsushi closed this as completed in #154 Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event service integration test flake #115

Event service integration test flake #115

richardcase commented Oct 7, 2021 •

edited by yitsushi

Loading

yitsushi commented Oct 8, 2021

richardcase commented Oct 14, 2021

yitsushi commented Oct 14, 2021

richardcase commented Oct 14, 2021

Event service integration test flake #115

Event service integration test flake #115

Comments

richardcase commented Oct 7, 2021 • edited by yitsushi Loading

yitsushi commented Oct 8, 2021

richardcase commented Oct 14, 2021

yitsushi commented Oct 14, 2021

richardcase commented Oct 14, 2021

richardcase commented Oct 7, 2021 •

edited by yitsushi

Loading