Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

Closed
anamehra opened this issue Nov 15, 2023 · 14 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

Description

On packet chassis, LC to LC connectivity via fabric uses iBGP sessions. Internal BFD over fabric interfaces is used to do fault detection for these iBGP sessions. On a scale setup with 3 or more fabric cards, it takes more time for orchagent to process the bfd session-up notifications from SAI during config reload or reboot. The reason for this delay is due to same notification queue being used for bfd notifications and bgp route learning notifications. During bgp/swss docker start, bfd and bgp configuration is applied together. As soon as a few bfd sessions come up, iBGP sessions start establishing. This also starts a flood of route-learning notifications for Orchagent. During this time when new bfd session-up notifications are sent by SAI, the processing for these new messages gets delayed.
On a scale setup with 5 FCs we observe that it may take up to 12 mins for orchagent to process all bfd session up messages since the start of docker.

If bgp sessions are kept in a down state during first ~3 mins of docker bring up, bfd session up messages are handled on time. After that, if bgp is started, the session bring up and route learning happens properly.

This GitHub issue is to find and implement a better way of handling the bfd and bgp session on chassis-packet.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@anamehra
Copy link
Contributor Author

Hi @rlhui , @abdosi , please assign this to me for now. May we plan to discuss this in the upcoming chassis meeting? Thanks

@vperumal , @rajendrat , for your viz.

@rlhui
Copy link
Contributor

rlhui commented Nov 15, 2023

why we need bfd session to be the fastest to come up before bgp? We may not want that right, as bfd is for monitoring/resiliency, but not necessarily needed in normal cases, unlike BGP which is critical.

@rlhui rlhui added the Triaged this issue has been triaged label Nov 15, 2023
@rlhui
Copy link
Contributor

rlhui commented Nov 15, 2023

what functional issue are we seeing because of this delay?

@arlakshm
Copy link
Contributor

priority queue present in swss already.

@arlakshm
Copy link
Contributor

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

@anamehra
Copy link
Contributor Author

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

Thanks @arlakshm , I will check on that.

@anamehra
Copy link
Contributor Author

what functional issue are we seeing because of this delay?

HI @rlhui , as such no functionality impact observed but the overall bringup of all bgp paths gets delayed.

@anamehra
Copy link
Contributor Author

anamehra commented Nov 29, 2023

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

Hi @arlakshm , I tried following config but it did not help much.
bgp graceful-restart restart-time 240
bgp graceful-restart select-defer-time 45

@rlhui
Copy link
Contributor

rlhui commented May 1, 2024

@arlakshm please include this in sonic-common-infra subgroup as one high priority problem to solve, thanks.

@liuh-80
Copy link
Contributor

liuh-80 commented Aug 30, 2024

I created a PR to fix #19569
Can someone verify it also fix this issue?
sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

@anamehra
Copy link
Contributor Author

I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

Thanks @liuh-80 , I am validating this fix.

@anamehra
Copy link
Contributor Author

I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

This change LGTM. I see lot of improvement with port an bfd notification handling.

@rlhui
Copy link
Contributor

rlhui commented Nov 8, 2024

@anamehra is this issue gone or still there?

@anamehra
Copy link
Contributor Author

fixed by sonic-net/sonic-swss#3328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

4 participants