-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPL TestMultipleProtocols unit test failing #3847
Comments
I didn't find the same error after triggering test again. It's a little weird because we didn't change recent linux port allocation code and it should still assign the same nodeport to a new rule with existed podip+podport and new protocol. And I think maybe this nodeport consistency check can be removed after we support unified port allocation? @antoninbas |
I think the Not sure if this is the root cause and why it can cause randomly failure for TestMultipleProtocols. |
I am taking a look at this and trying to reproduce. |
I am also looking at this issue, and I doubt it is related with the order of update Pod2's label and update Service2's label ( not sure about it). The current testing code is first to update Pod2's label, then Service2's label. When npl_controller processes Pod2's update, if the Service2 is not updated yet, it should first add TCP protocol ( selected by Service1 ), then delete UDP protocol ( selected by original Service2). But is Service2 is updatd in the mock apiserver cache, the process for npl_controller should be add UDP protocol only. This might introduce a un-sure conditions, not sure if the issue is introduced by this. |
Wenying and I just found a possible case that may cause this issue. Although the pod label was updated earlier( But why this issue was first observed after we merged recent NPL PR? I am not sure but maybe the longer independent annotation could change execution time for some npl controller workers. I will add a sleep function between pod update and service update to ensure execution sequence, then rerun multi rounds unit test to see if there is will be the same error. @antoninbas |
@XinShuYang I was able to reproduce it in my fork with a simple workflow: https://github.com/antoninbas/antrea/blob/go-test/.github/workflows/go-test.yml. You may want to do something similar in your fork. It took 3 tries, and it seems to only happen with the I increased the log verbosity a bit for my test, and I saw the following logs:
Notice that there is a 3rd
Yes, the change in timings has led to this issue becoming apparent, there was probably always a race.
If your explanation is correct, I guess it is an issue with the test case itself, and not with the NPL implementation. Even though it's surprising behavior, if there is a (short) window of time during which the Pod is no longer used for NPL, it makes sense for the port assignment to be released. |
issue antrea-io#3847 Ensure the execution sequence to avoid unexpected annotation in race condition. Signed-off-by: Shuyang Xin <[email protected]>
Thanks, I will also try to reproduce the error and collect logs with your setup. |
In failed case the annotation is: After adding a waiting window, the test has not failed again. |
The root cause is service could update before pod label update in race condition. In this case, the pod2 annotation was fully removed firstly. Then NPL controller added a new nodeport with both tcp and udp rules to pod2 by service update. As the result, the pod2 annotation was: '[{80 10.10.10.10 61002 tcp [tcp]} {80 10.10.10.10 61002 udp [udp]}]'. issue antrea-io#3847 The fix ensures the execution sequence to avoid unexpected annotation in race condition. Annotation after fix: '[{80 10.10.10.10 61001 tcp [tcp]} {80 10.10.10.10 61001 udp [udp]}]' Signed-off-by: Shuyang Xin <[email protected]>
The failed case has updated both Service and Pod. The issue happens when Service update is processed by NPL controller before Pod update. In this case, the pod2 annotation was fully removed firstly. Then NPL controller added a new nodeport with both tcp and udp rules to pod2 by service update. As the result, the pod2 annotation was: '[{80 10.10.10.10 61002 tcp [tcp]} {80 10.10.10.10 61002 udp [udp]}]'. issue antrea-io#3847 The fix ensures the execution sequence to avoid unexpected annotation in race condition. Annotation after fix: '[{80 10.10.10.10 61001 tcp [tcp]} {80 10.10.10.10 61001 udp [udp]}]' Signed-off-by: Shuyang Xin <[email protected]>
The failed case has updated both Service and Pod. The issue happens when Service update is processed by NPL controller before Pod update. In this case, the pod2 annotation was fully removed firstly. Then NPL controller added a new nodeport with both tcp and udp rules to pod2 by service update. As the result, the pod2 annotation was: '[{80 10.10.10.10 61002 tcp [tcp]} {80 10.10.10.10 61002 udp [udp]}]'. issue antrea-io#3847 The fix removes unnecessary label update for pod2 to avoid unexpected annotation in race condition. Annotation after fix: '[{80 10.10.10.10 61001 tcp [tcp]} {80 10.10.10.10 61001 udp [udp]}]' Signed-off-by: Shuyang Xin <[email protected]>
Describe the bug
https://github.com/antrea-io/antrea/actions/runs/2416940934
Versions:
main branch
The text was updated successfully, but these errors were encountered: