[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

anamehra · 2023-11-15T06:41:00Z

Description

On packet chassis, LC to LC connectivity via fabric uses iBGP sessions. Internal BFD over fabric interfaces is used to do fault detection for these iBGP sessions. On a scale setup with 3 or more fabric cards, it takes more time for orchagent to process the bfd session-up notifications from SAI during config reload or reboot. The reason for this delay is due to same notification queue being used for bfd notifications and bgp route learning notifications. During bgp/swss docker start, bfd and bgp configuration is applied together. As soon as a few bfd sessions come up, iBGP sessions start establishing. This also starts a flood of route-learning notifications for Orchagent. During this time when new bfd session-up notifications are sent by SAI, the processing for these new messages gets delayed.
On a scale setup with 5 FCs we observe that it may take up to 12 mins for orchagent to process all bfd session up messages since the start of docker.

If bgp sessions are kept in a down state during first ~3 mins of docker bring up, bfd session up messages are handled on time. After that, if bgp is started, the session bring up and route learning happens properly.

This GitHub issue is to find and implement a better way of handling the bfd and bgp session on chassis-packet.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of `show version`:

(paste your output here)

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

anamehra · 2023-11-15T06:42:46Z

Hi @rlhui , @abdosi , please assign this to me for now. May we plan to discuss this in the upcoming chassis meeting? Thanks

@vperumal , @rajendrat , for your viz.

rlhui · 2023-11-15T18:22:16Z

why we need bfd session to be the fastest to come up before bgp? We may not want that right, as bfd is for monitoring/resiliency, but not necessarily needed in normal cases, unlike BGP which is critical.

rlhui · 2023-11-15T18:28:31Z

what functional issue are we seeing because of this delay?

arlakshm · 2023-11-22T16:35:53Z

priority queue present in swss already.

arlakshm · 2023-11-22T16:38:54Z

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

anamehra · 2023-11-22T17:20:07Z

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

Thanks @arlakshm , I will check on that.

anamehra · 2023-11-22T17:21:26Z

what functional issue are we seeing because of this delay?

HI @rlhui , as such no functionality impact observed but the overall bringup of all bgp paths gets delayed.

anamehra · 2023-11-29T18:05:43Z

dual tor has the mechanism to delay bgp session bring up using a FRR configuration.

Hi @arlakshm , I tried following config but it did not help much.
bgp graceful-restart restart-time 240
bgp graceful-restart select-defer-time 45

rlhui · 2024-05-01T18:00:40Z

@arlakshm please include this in sonic-common-infra subgroup as one high priority problem to solve, thanks.

liuh-80 · 2024-08-30T06:26:36Z

I created a PR to fix #19569
Can someone verify it also fix this issue?
sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

anamehra · 2024-08-30T06:29:08Z

I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

Thanks @liuh-80 , I am validating this fix.

anamehra · 2024-08-30T19:56:58Z

I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269

The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569

This change LGTM. I see lot of improvement with port an bfd notification handling.

rlhui · 2024-11-08T05:30:16Z

@anamehra is this issue gone or still there?

anamehra · 2024-11-20T18:46:33Z

fixed by sonic-net/sonic-swss#3328

rlhui added this to SONiC Chassis Nov 15, 2023

rlhui assigned anamehra Nov 15, 2023

rlhui added the Triaged this issue has been triaged label Nov 15, 2023

judyjoseph mentioned this issue Jul 31, 2024

[202405] [Chassis]: Ports take too long to come up due to delayed port up notification processing by orchagent #19569

Closed

liuh-80 mentioned this issue Aug 30, 2024

Fix port up/bfd sessions bringup notification delay issue. sonic-net/sonic-swss#3269

Merged

10 tasks

anamehra closed this as completed Nov 20, 2024

github-project-automation bot moved this to Done in SONiC Chassis Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

anamehra commented Nov 15, 2023

anamehra commented Nov 15, 2023

rlhui commented Nov 15, 2023

rlhui commented Nov 15, 2023

arlakshm commented Nov 22, 2023

arlakshm commented Nov 22, 2023

anamehra commented Nov 22, 2023

anamehra commented Nov 22, 2023

anamehra commented Nov 29, 2023 •

edited

Loading

rlhui commented May 1, 2024

liuh-80 commented Aug 30, 2024

anamehra commented Aug 30, 2024

anamehra commented Aug 30, 2024

rlhui commented Nov 8, 2024

anamehra commented Nov 20, 2024

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

[chassis-packet]: internal bfd sessions bringup delays during config reload/reboot #17180

Comments

anamehra commented Nov 15, 2023

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

anamehra commented Nov 15, 2023

rlhui commented Nov 15, 2023

rlhui commented Nov 15, 2023

arlakshm commented Nov 22, 2023

arlakshm commented Nov 22, 2023

anamehra commented Nov 22, 2023

anamehra commented Nov 22, 2023

anamehra commented Nov 29, 2023 • edited Loading

rlhui commented May 1, 2024

liuh-80 commented Aug 30, 2024

anamehra commented Aug 30, 2024

anamehra commented Aug 30, 2024

rlhui commented Nov 8, 2024

anamehra commented Nov 20, 2024

Output of `show version`:

Output of `show techsupport`:

anamehra commented Nov 29, 2023 •

edited

Loading