-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[202012] Fix arp_update script #8412
Conversation
is this 202012 issue, or this issue also exists on master? if on master, can you generate the pr on master and request for cherry-pick? |
Issue is not reproduced on latest master. After each config reload |
files/scripts/arp_update
Outdated
@@ -25,6 +25,8 @@ while /bin/true; do | |||
fi | |||
done | |||
|
|||
sleep 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your description, I feel the problem here is the ping6 command could get stuck. Adding a wait might not completely solve the problem since there could still be some cases (e.g., the wait time is not enough, some improper vlan config) where it may get stuck. Maybe we can do something like adding a few seconds timeout to ping6cmd
so that the script would not hang with the command? @prsunny what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Shi, you can refer line 70 for adding timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a one second timeout to ping6 command. If we increase the timeout it can lead to a problem when we have a lot of VLAN's on device. Because each ping to VLAN will be with timeout, now time that we spend on wait is: number of VLAN's * timeout (by now one second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK since the timeout is not expected to wait every time. For example, if the VLANs come up in one second (as the sleep you added in the previous commit), only the first one or two would have the timeout.
I think this fix should be added to the master branch as well. Even if this is not yet observed on master yet, it would be more future-proof, and it would also help to prevent similar issues on other platforms. |
@vmorokhx thanks for you contribution. next time, can you open pr to master branch, the mark cherry-pick, then we can do the cherry-pick. |
Fix #7968 Issue is detected on SONiC.20201231.11 In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes. It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured. When issue is reproduced, stuck ping6 process is observed in swss container : USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 180 0.1 0.0 6296 1272 pts/0 S 17:03 0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1 And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
Fix #7968 Issue is detected on SONiC.20201231.11 In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes. It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured. When issue is reproduced, stuck ping6 process is observed in swss container : USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 180 0.1 0.0 6296 1272 pts/0 S 17:03 0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1 And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
Why I did it
Fix #7968
Issue is detected on
SONiC.20201231.11
In
test_static_route.py::test_static_route_ecmp
static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.It looks like the
arp_update
script is starting to ping when Vlan1000 is not fully configured.When issue is reproduced, stuck ping6 process is observed in swss container :
And when arp_update script successfully resolves neighbors, we observe
sleep 300
instead of ping processHow I did it
Add sleep to be sure that all interfaces in up state and can be reachable by ping command
How to verify it
Run it manually with -x option
and run
test_static_route.py
Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)