-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[neighorch] VOQ encap index change handling #1729
Conversation
@abdosi - could you please take a look, thanks. |
@ysmanman please take a look. |
orchagent/neighorch.cpp
Outdated
// Handle encap index change. SAI does not support change of encap index for | ||
// existing neighbors. Remove the neighbor but do not errase from consumer sync | ||
// buffer. The next iteration will add the neighbor back with new encap index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleing neigh and adding it back may churn routers resolving over that neigh. So routes get deleted and added back. I guess encap index change is common. Probably should be seen in syncd restart, as you mentioned, or linecard reload? Ideally, encap index change should be handled by SAI just like how it handles MAC change today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: i meant to say encap index change is NOT common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. SAI should handle the encap index change similar to mac change. But current SAI does not support it. Till SAI supports it, we will have to live with this delete and re-add approach. Once SAI is enhanced for the encap index change handling, we can change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vganesan-nokia can you create issue in SONiC for tracking SAI not supporting set operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vganesan-nokia can you create issue in SONiC for tracking SAI not supporting set operation
Yes. Created issue sonic-net/sonic-buildimage#7879. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was discussing with @prsunny and we think better option will be call SAI api with set and if it fails (which it will) then we should do remove/add neigh. That way once the attribute is supported we don't need any sonic change going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with this approach. In current SAI (w.r.t brcm SAI 5.0), when set is done for encap index attribute in the neighbor record, SAI's neighbor entry set attribute API returns SAI_STATUS_FAILURE. When there is not SUCCESS from SAI, syncd sends orchagent shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Since syncd is not sending orchagent shutdown for SET failures anymore, changes are done as suggested.
@@ -1133,8 +1133,30 @@ void NeighOrch::doVoqSystemNeighTask(Consumer &consumer) | |||
} | |||
|
|||
if (m_syncdNeighbors.find(neighbor_entry) == m_syncdNeighbors.end() || | |||
m_syncdNeighbors[neighbor_entry].mac != mac_address) | |||
m_syncdNeighbors[neighbor_entry].mac != mac_address || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check mac for remote neighbor? Isn't checking encap sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. For mac change, the encap index is not changed. If we only check encap index, we'll not be processing the mac change.
orchagent/neighorch.cpp
Outdated
//neigh successfully deleted from SAI. Set STATE DB to signal to remove entries from kernel | ||
m_stateSystemNeighTable->del(state_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we only do this when removeNeighbor succeeds? If removeNeighbor fails for some reasons, kernel entries are not removed and then traffic may still be forwarded with wrong encap index. Should we signal removing kernel entries always when encap index changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this for neat re-trying. Removal of kernel neigh without successful removal of neigh from SAI is partial processing. This will create problems in retries without additional logic.
@neethajohn - please review and approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as added comment.
b27a47d
to
62e2c4e
Compare
Signed-off-by: vedganes <[email protected]> For the remote neighbors in the VOQ systems when there is change in the encap index (due to syncd restart), the previously programmed remote neighs should be re-written with the new changed encap index. The current voq system neigh handling api in neighorch only checks for the change of mac address. Because of this the change in the encap index is ignored. This causes mismatch of encap index in the neighbor records in owner asic and the remote asics after syncd restart situations like config reload or failure recovery. This results in packet forwarding failures for the packets forwarded across fabric and egressing in different asics than ingress asics. The encap index that was previously allocated by SAI and synced to the CHASSIS_APP_DB may change for the same neighbor when syncd is restarted after config reload or any failure recovery situations. When syncd restarts, the SAI may undergo full re-initialization and may loose the memory of previous encap index allocation for the neighbors. During reprogramming of the neighbors they may not get the same encap index that were allocated before restart. This fix is to store the allocated encap index in the orchagent (neighorch) and compare with received encap index if the neighbor entry already exits. This comparison of encap index is new to voq chassis. This is done in addition to checking for mac change. For any received (synced) neigh (received from CHASSIS_APP_DB) if neigh entry already exists and if there is change in encap index, the current neigh entry is removed and re-added. As of now, since current SAI does not support change of encap index (i.e, set of operaton of encap index attribute for neigh records) del and re-add is done.
Changes done to handle the encap index change by first try to SET and if there is failure from SAI for the SET of encap index attribute in neighbor entry, do remove and re-add the whole neighbor entry. Signed-off-by: vedganes <[email protected]>
62e2c4e
to
481cf18
Compare
Signed-off-by: vedganes <[email protected]>
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803) [mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800) [Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762) [neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729) [neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724) [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803) [mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800) [Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762) [neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729) [neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724) [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
[flex-counters] Delay flex counters stats init for faster boot time (sonic-net/sonic-swss#1803) [mirror] Detach session dst ip from route orch LPM calculation regardless of session status at session CONFIG DB removal (sonic-net/sonic-swss#1800) [Dynamic Buffer Calc] Support dynamic buffer calculation on top of port auto negotiation (sonic-net/sonic-swss#1762) [neighorch] VOQ encap index change handling (sonic-net/sonic-swss#1729) [neighorch] Mac for voq neighbors in VS platforms (sonic-net/sonic-swss#1724) [acl mirror action] Mirror session ref count fix at acl rule attachment (sonic-net/sonic-swss#1761)
For the remote neighbors in the VOQ systems when there is change in the encap index (due to syncd restart), the previously programmed remote neighs should be re-written with the new changed encap index. The current voq system neigh handling api in neighorch only checks for the change of mac address. Because of this the change in the encap index is ignored.
For the remote neighbors in the VOQ systems when there is change in the encap index (due to syncd restart), the previously programmed remote neighs should be re-written with the new changed encap index. The current voq system neigh handling api in neighorch only checks for the change of mac address. Because of this the change in the encap index is ignored.
What I did
For the remote neighbors in the VOQ systems when there is change in the
encap index (due to syncd restart), the previously programmed remote
neighs should be re-written with the new changed encap index. The
current voq system neigh handling api in neighorch only checks for the
change of mac address. Because of this the change in the encap index is
ignored. This causes mismatch of encap index in the neighbor records in
owner asic and the remote asics after syncd restart situations like
config reload or failure recovery. This results in packet forwarding
failures for the packets forwarded across fabric and egressing in
different asics than ingress asics. The encap index that was previously
allocated by SAI and synced to the CHASSIS_APP_DB may change for the
same neighbor when syncd is restarted after config reload or any failure
recovery situations. When syncd restarts, the SAI may undergo full
re-initialization and may loose the memory of previous encap index
allocation for the neighbors. During reprogramming of the neighbors they
may not get the same encap index that were allocated before restart.
This fix is to store the allocated encap index in the orchagent
(neighorch) and compare with received encap index if the neighbor entry
already exits. This comparison of encap index is new to voq chassis.
This is done in addition to checking for mac change. For any received
(synced) neigh (received from CHASSIS_APP_DB) if neigh entry already
exists and if there is change in encap index, the current neigh entry
is removed and re-added. As of now, since current SAI does not support
change of encap index (i.e, set of operaton of encap index attribute for
neigh records) del and re-add is done.
Why I did it
To fix the traffic loss problem in voq chassis after config reload. The problem is described in issue sonic-net/sonic-buildimage#7451
How I verified it
Details if related