Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202012] Fix arp_update script #8412

Merged
merged 2 commits into from
Aug 13, 2021
Merged

[202012] Fix arp_update script #8412

merged 2 commits into from
Aug 13, 2021

Conversation

vmorokhx
Copy link
Contributor

@vmorokhx vmorokhx commented Aug 10, 2021

Why I did it

Fix #7968

Issue is detected on SONiC.20201231.11
In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.
It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured.
When issue is reproduced, stuck ping6 process is observed in swss container :

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         180  0.1  0.0   6296  1272 pts/0    S    17:03   0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1

And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process

How I did it

Add sleep to be sure that all interfaces in up state and can be reachable by ping command

How to verify it

Run it manually with -x option
and run test_static_route.py

route/test_static_route.py::test_static_route PASSED
route/test_static_route.py::test_static_route_ecmp PASSED
route/test_static_route.py::test_static_route_ipv6 PASSED
route/test_static_route.py::test_static_route_ecmp_ipv6 SKIPPED

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

image

@lguohan
Copy link
Collaborator

lguohan commented Aug 11, 2021

is this 202012 issue, or this issue also exists on master? if on master, can you generate the pr on master and request for cherry-pick?

@lguohan
Copy link
Collaborator

lguohan commented Aug 11, 2021

@shi-su and @prsunny , can you take a look at this fix? I am not sure sleep 1 is good enough?

@lguohan lguohan requested a review from shi-su August 11, 2021 09:08
@vmorokhx
Copy link
Contributor Author

Issue is not reproduced on latest master. After each config reload sleep 300 process is always observed, and script successfully resolves neighbors.

@@ -25,6 +25,8 @@ while /bin/true; do
fi
done

sleep 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your description, I feel the problem here is the ping6 command could get stuck. Adding a wait might not completely solve the problem since there could still be some cases (e.g., the wait time is not enough, some improper vlan config) where it may get stuck. Maybe we can do something like adding a few seconds timeout to ping6cmd so that the script would not hang with the command? @prsunny what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Shi, you can refer line 70 for adding timeout.

Copy link
Contributor Author

@vmorokhx vmorokhx Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a one second timeout to ping6 command. If we increase the timeout it can lead to a problem when we have a lot of VLAN's on device. Because each ping to VLAN will be with timeout, now time that we spend on wait is: number of VLAN's * timeout (by now one second).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK since the timeout is not expected to wait every time. For example, if the VLANs come up in one second (as the sleep you added in the previous commit), only the first one or two would have the timeout.

shi-su
shi-su previously approved these changes Aug 12, 2021
@shi-su
Copy link
Contributor

shi-su commented Aug 12, 2021

I think this fix should be added to the master branch as well. Even if this is not yet observed on master yet, it would be more future-proof, and it would also help to prevent similar issues on other platforms.

@lguohan
Copy link
Collaborator

lguohan commented Aug 13, 2021

@vmorokhx thanks for you contribution. next time, can you open pr to master branch, the mark cherry-pick, then we can do the cherry-pick.

@lguohan lguohan changed the base branch from 202012 to master August 13, 2021 06:22
@lguohan lguohan dismissed shi-su’s stale review August 13, 2021 06:22

The base branch was changed.

@lguohan lguohan changed the base branch from master to 202012 August 13, 2021 06:24
@lguohan lguohan merged commit 754378f into sonic-net:202012 Aug 13, 2021
lguohan pushed a commit that referenced this pull request Aug 13, 2021
Fix #7968

Issue is detected on SONiC.20201231.11

In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.
It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured.
When issue is reproduced, stuck ping6 process is observed in swss container :

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         180  0.1  0.0   6296  1272 pts/0    S    17:03   0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1
And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
judyjoseph pushed a commit that referenced this pull request Aug 25, 2021
Fix #7968

Issue is detected on SONiC.20201231.11

In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.
It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured.
When issue is reproduced, stuck ping6 process is observed in swss container :

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         180  0.1  0.0   6296  1272 pts/0    S    17:03   0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1
And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[static route] No arp entries for nexthops in the static route after config reload
5 participants