-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event service integration test flake #115
Labels
area/testing
Indicates an issue related to test
kind/bug
Something isn't working
priority/critical-urgent
Highest priority. Must be actively worked on as someone's top priority right now.
Comments
richardcase
added
kind/bug
Something isn't working
area/testing
Indicates an issue related to test
labels
Oct 7, 2021
If I'm correct, it's fixed by #120 I think this one can be closed. |
We appear to still be having issues with the even service tests. |
I did not see integration test failures for it. Can you link a failed build please? |
richardcase
added
the
priority/critical-urgent
Highest priority. Must be actively worked on as someone's top priority right now.
label
Oct 19, 2021
yitsushi
added a commit
to yitsushi/flintlock
that referenced
this issue
Oct 20, 2021
I was not able to reproduce the issue on my machine (why would I), first I did a bit of cleanup to reduce the nested definitions. It's a bit easier to follow now. Added extra logging about contexts, so we can know which one throws the `code = Canceled desc = context canceled` error. Created a PR on my fork and restarted the test job 4 times, none of them failed. I assume the real fix is break after `subscriber.cancel()`. I'm not 100% convinced, but potentially when we close the context and next time checking for `eventCh` or `eventErrCh`, theoretically both of them are closed when the next loop starts, but we are talking very short CPU cycles, so anything can happen. Related to liquidmetal-dev#115 Intentionally not marking wth `Fixes #`.
2 tasks
yitsushi
added a commit
to yitsushi/flintlock
that referenced
this issue
Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes liquidmetal-dev#115
yitsushi
added a commit
to yitsushi/flintlock
that referenced
this issue
Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes liquidmetal-dev#115
yitsushi
added a commit
to yitsushi/flintlock
that referenced
this issue
Oct 20, 2021
I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes liquidmetal-dev#115
yitsushi
added a commit
that referenced
this issue
Oct 21, 2021
* Fix EventService test I was able to reproduce the issue with a helping had from cgroup. Limiting to one CPU with a heavily quota (0.01% of my CPU) revealed the issue. Note before I describe what happened: Containerd does not send messages to subscribers retrospectively, all subscribers will receive events from the point they subscribed. The original test created published N events and after that created subscribers. The connection between the test and containerd is much slower than the execution of the test, so when containrd wants to send out the events to all subscribers, they are already there to receive events. That's why it works on my machine and that's why it can pass on Github Actions sometimes. However, on a slow machine, with only one vcpu, the test and containerd are racing for their own cpu share. In this scenario, the events are already published before the subscribers are ready to receive them. Solution: Create subscribers first and then publish events. Disclaimer: There is a chance, all I wrote above is not entire correct, but that's how I understand it. It does not really matter if we only check the logic in the test. Originally it was publish->subscribe, but it's not correct, we need subscribers before we publish events. Fixes #115
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/testing
Indicates an issue related to test
kind/bug
Something isn't working
priority/critical-urgent
Highest priority. Must be actively worked on as someone's top priority right now.
What happened:
The "integration" tests for the event service aren't very stable and often have to be re-run as we get an error:
What did you expect to happen:
I expect them to pass consistently
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
/etc/os-release
):The text was updated successfully, but these errors were encountered: