Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neighorch] VOQ encap index change handling #1729

Merged
merged 3 commits into from
Jul 2, 2021

Conversation

vganesan-nokia
Copy link
Contributor

What I did

For the remote neighbors in the VOQ systems when there is change in the
encap index (due to syncd restart), the previously programmed remote
neighs should be re-written with the new changed encap index. The
current voq system neigh handling api in neighorch only checks for the
change of mac address. Because of this the change in the encap index is
ignored. This causes mismatch of encap index in the neighbor records in
owner asic and the remote asics after syncd restart situations like
config reload or failure recovery. This results in packet forwarding
failures for the packets forwarded across fabric and egressing in
different asics than ingress asics. The encap index that was previously
allocated by SAI and synced to the CHASSIS_APP_DB may change for the
same neighbor when syncd is restarted after config reload or any failure
recovery situations. When syncd restarts, the SAI may undergo full
re-initialization and may loose the memory of previous encap index
allocation for the neighbors. During reprogramming of the neighbors they
may not get the same encap index that were allocated before restart.

This fix is to store the allocated encap index in the orchagent
(neighorch) and compare with received encap index if the neighbor entry
already exits. This comparison of encap index is new to voq chassis.
This is done in addition to checking for mac change. For any received
(synced) neigh (received from CHASSIS_APP_DB) if neigh entry already
exists and if there is change in encap index, the current neigh entry
is removed and re-added. As of now, since current SAI does not support
change of encap index (i.e, set of operaton of encap index attribute for
neigh records) del and re-add is done.

Why I did it

To fix the traffic loss problem in voq chassis after config reload. The problem is described in issue sonic-net/sonic-buildimage#7451

How I verified it

Details if related

@anshuv-mfst
Copy link

@abdosi - could you please take a look, thanks.

@anshuv-mfst
Copy link

@ysmanman please take a look.

Comment on lines 1139 to 1238
// Handle encap index change. SAI does not support change of encap index for
// existing neighbors. Remove the neighbor but do not errase from consumer sync
// buffer. The next iteration will add the neighbor back with new encap index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleing neigh and adding it back may churn routers resolving over that neigh. So routes get deleted and added back. I guess encap index change is common. Probably should be seen in syncd restart, as you mentioned, or linecard reload? Ideally, encap index change should be handled by SAI just like how it handles MAC change today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: i meant to say encap index change is NOT common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. SAI should handle the encap index change similar to mac change. But current SAI does not support it. Till SAI supports it, we will have to live with this delete and re-add approach. Once SAI is enhanced for the encap index change handling, we can change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vganesan-nokia can you create issue in SONiC for tracking SAI not supporting set operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vganesan-nokia can you create issue in SONiC for tracking SAI not supporting set operation

Yes. Created issue sonic-net/sonic-buildimage#7879. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was discussing with @prsunny and we think better option will be call SAI api with set and if it fails (which it will) then we should do remove/add neigh. That way once the attribute is supported we don't need any sonic change going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with this approach. In current SAI (w.r.t brcm SAI 5.0), when set is done for encap index attribute in the neighbor record, SAI's neighbor entry set attribute API returns SAI_STATUS_FAILURE. When there is not SUCCESS from SAI, syncd sends orchagent shut down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Since syncd is not sending orchagent shutdown for SET failures anymore, changes are done as suggested.

@@ -1133,8 +1133,30 @@ void NeighOrch::doVoqSystemNeighTask(Consumer &consumer)
}

if (m_syncdNeighbors.find(neighbor_entry) == m_syncdNeighbors.end() ||
m_syncdNeighbors[neighbor_entry].mac != mac_address)
m_syncdNeighbors[neighbor_entry].mac != mac_address ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check mac for remote neighbor? Isn't checking encap sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. For mac change, the encap index is not changed. If we only check encap index, we'll not be processing the mac change.

Comment on lines 1149 to 1247
//neigh successfully deleted from SAI. Set STATE DB to signal to remove entries from kernel
m_stateSystemNeighTable->del(state_key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we only do this when removeNeighbor succeeds? If removeNeighbor fails for some reasons, kernel entries are not removed and then traffic may still be forwarded with wrong encap index. Should we signal removing kernel entries always when encap index changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this for neat re-trying. Removal of kernel neigh without successful removal of neigh from SAI is partial processing. This will create problems in retries without additional logic.

ysmanman
ysmanman previously approved these changes May 25, 2021
@anshuv-mfst
Copy link

@neethajohn - please review and approve.

Copy link
Contributor

@abdosi abdosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as added comment.

vedganes added 2 commits June 29, 2021 13:27
Signed-off-by: vedganes <[email protected]>

For the remote neighbors in the VOQ systems when there is change in the
encap index (due to syncd restart), the previously programmed remote
neighs should be re-written with the new changed encap index. The
current voq system neigh handling api in neighorch only checks for the
change of mac address. Because of this the change in the encap index is
ignored. This causes mismatch of encap index in the neighbor records in
owner asic and the remote asics after syncd restart situations like
config reload or failure recovery. This results in packet forwarding
failures for the packets forwarded across fabric and egressing in
different asics than ingress asics. The encap index that was previously
allocated by SAI and synced to the CHASSIS_APP_DB may change for the
same neighbor when syncd is restarted after config reload or any failure
recovery situations. When syncd restarts, the SAI may undergo full
re-initialization and may loose the memory of previous encap index
allocation for the neighbors. During reprogramming of the neighbors they
may not get the same encap index that were allocated before restart.

This fix is to store the allocated encap index in the orchagent
(neighorch) and compare with received encap index if the neighbor entry
already exits. This comparison of encap index is new to voq chassis.
This is done in addition to checking for mac change. For any received
(synced) neigh (received from CHASSIS_APP_DB) if neigh entry already
exists and if there is change in encap  index, the current neigh entry
is removed and re-added. As of now, since current SAI does not support
change of encap index (i.e, set of operaton of encap index attribute for
neigh records) del and re-add is done.
Changes done to handle the encap index change by first try to SET and
if there is failure from SAI for the SET of encap index attribute in
neighbor entry, do remove and re-add the whole neighbor entry.

Signed-off-by: vedganes <[email protected]>
@abdosi abdosi merged commit 8a169fc into sonic-net:master Jul 2, 2021
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jul 5, 2021
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803)
[mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800)
[Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762)
[neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729)
[neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724)
[acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
judyjoseph pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Aug 7, 2021
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803)
[mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800)
[Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762)
[neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729)
[neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724)
[acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803)
[mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800)
[Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762)
[neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729)
[neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724)
[acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
judyjoseph pushed a commit that referenced this pull request Aug 10, 2021
For the remote neighbors in the VOQ systems when there is change in the
encap index (due to syncd restart), the previously programmed remote
neighs should be re-written with the new changed encap index. The
current voq system neigh handling api in neighorch only checks for the
change of mac address. Because of this the change in the encap index is
ignored.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
For the remote neighbors in the VOQ systems when there is change in the
encap index (due to syncd restart), the previously programmed remote
neighs should be re-written with the new changed encap index. The
current voq system neigh handling api in neighorch only checks for the
change of mac address. Because of this the change in the encap index is
ignored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants