Incorrect route for 0.0.0.0 in APPL_DB #6483

bingwang-ms · 2021-01-18T12:06:16Z

Description
The test case test_hash will extract route from APPL_DB by redis-dump -d 0 -k 'ROUTE*' -y.
Recently, test_hash was consistently failing with exception

E                   "Traceback (most recent call last):", 
E                   "  File \"ptftests/hash_test.py\", line 264, in runTest", 
E                   "    self.check_hash(hash_key)", 
E                   "  File \"ptftests/hash_test.py\", line 83, in check_hash", 
E                   "    (matched_index, _) = self.check_ip_route(hash_key, in_port, dst_ip, exp_port_list)", 
E                   "  File \"ptftests/hash_test.py\", line 91, in check_ip_route", 
E                   "    (matched_index, received) = self.check_ipv4_route(hash_key, in_port, dst_port_list)", 
E                   "  File \"ptftests/hash_test.py\", line 161, in check_ipv4_route", 
E                   "    return verify_packet_any_port(self, masked_exp_pkt, dst_port_list)", 
E                   "  File \"/usr/lib/python2.7/dist-packages/ptf/testutils.py\", line 2538, in verify_packet_any_port", 
E                   "    % (ports, device_number, result.format()))", 
E                   "AssertionError: Did not receive expected packet on any of ports [] for device 0.",

The exception was because of incorrect default route, which results in empty exp_port_list.

"ROUTE_TABLE:0.0.0.0/0": {
    "expireat": 1610968205.649774, 
    "ttl": -0.001, 
    "type": "hash", 
    "value": {
      "ifname": "eth0", 
      "nexthop": "10.3.146.1"
    }
  }

The most bizarre thing is the route in kernel is correct.

root@str2-7260cx3-acs-9:~# show ip route 0.0.0.0
Routing entry for 0.0.0.0/0
  Known via "bgp", distance 20, metric 0, best
  Last update 02:06:27 ago
  * 10.0.0.33, via PortChannel0001, weight 1
  * 10.0.0.35, via PortChannel0002, weight 1
  * 10.0.0.37, via PortChannel0003, weight 1
  * 10.0.0.39, via PortChannel0004, weight 1

Still not sure if this issue is caused by previously running test cases or triggered by some operations.

Steps to reproduce the issue:
The issue is not easy to repro when running manually. But was consistently failing on regression test.

Run test_hash

Describe the results you received:
Test case should pass.

Describe the results you expected:
Test case failed.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

SONiC Software Version: SONiC.HEAD.260-1dcab4d1
Distribution: Debian 10.7
Kernel: 4.19.0-9-2-amd64
Build commit: 1dcab4d1
Build date: Wed Jan 13 14:30:11 UTC 2021
Built by: johnar@jenkins-worker-22

Platform: x86_64-arista_7260cx3_64
HwSKU: Arista-7260CX3-D108C8
ASIC: broadcom
ASIC Count: 1
Serial Number: JPE20271584
Uptime: 12:05:09 up 19 min,  1 user,  load average: 3.11, 2.55, 1.92

Docker images:
REPOSITORY                    TAG                 IMAGE ID            SIZE
docker-syncd-brcm             HEAD.260-1dcab4d1   ca05b2873d11        640MB
docker-syncd-brcm             latest              ca05b2873d11        640MB
docker-snmp                   HEAD.260-1dcab4d1   31464fc4ec71        434MB
docker-snmp                   latest              31464fc4ec71        434MB
docker-teamd                  HEAD.260-1dcab4d1   636c9ea31de4        462MB
docker-teamd                  latest              636c9ea31de4        462MB
docker-nat                    HEAD.260-1dcab4d1   c5b3d6107182        465MB
docker-nat                    latest              c5b3d6107182        465MB
docker-router-advertiser      HEAD.260-1dcab4d1   96bb9ae5db0f        394MB
docker-router-advertiser      latest              96bb9ae5db0f        394MB
docker-platform-monitor       HEAD.260-1dcab4d1   cd0aa4b805a8        599MB
docker-platform-monitor       latest              cd0aa4b805a8        599MB
docker-lldp                   HEAD.260-1dcab4d1   f32e1e39f80a        434MB
docker-lldp                   latest              f32e1e39f80a        434MB
docker-dhcp-relay             HEAD.260-1dcab4d1   121828b00a2f        400MB
docker-dhcp-relay             latest              121828b00a2f        400MB
docker-sonic-mgmt-framework   HEAD.260-1dcab4d1   35d87d2d0014        607MB
docker-sonic-mgmt-framework   latest              35d87d2d0014        607MB
docker-orchagent              HEAD.260-1dcab4d1   b7ac345616a5        480MB
docker-orchagent              latest              b7ac345616a5        480MB
docker-macsec                 HEAD.260-1dcab4d1   dd50a44dd043        465MB
docker-macsec                 latest              dd50a44dd043        465MB
docker-sonic-telemetry        HEAD.260-1dcab4d1   01d17b325f2a        468MB
docker-sonic-telemetry        latest              01d17b325f2a        468MB
docker-fpm-frr                HEAD.260-1dcab4d1   7e22fd0f3300        479MB
docker-fpm-frr                latest              7e22fd0f3300        479MB
docker-sflow                  HEAD.260-1dcab4d1   5f969b191610        463MB
docker-sflow                  latest              5f969b191610        463MB
docker-database               HEAD.260-1dcab4d1   250a04c60781        393MB
docker-database               latest              250a04c60781        393MB

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

The text was updated successfully, but these errors were encountered:

bingwang-ms · 2021-01-22T07:09:55Z

I figured out the method to repro the issue. If we shutdown some interface on DUT, the default route in APPL_DB will be lost.
Normally, the route should be

admin@str-dx010-acs-4:~$ redis-cli -n 0 hgetall 'ROUTE_TABLE:0.0.0.0/0'
1) "nexthop"
2) "10.0.0.57,10.0.0.59,10.0.0.61,10.0.0.63"
3) "ifname"
4) "PortChannel0001,PortChannel0002,PortChannel0003,PortChannel0004"

If we shutdown an interface (say Ethernet40), the route will be lost

admin@str-dx010-acs-4:~$ sudo config interface shutdown Ethernet40
admin@str-dx010-acs-4:~$ redis-cli -n 0 hgetall 'ROUTE_TABLE:0.0.0.0/0'
1) "nexthop"
2) "10.64.246.1"
3) "ifname"
4) "eth0"

Logs are following

Jan 22 07:02:59.950246 str-dx010-acs-4 INFO syncd#syncd: [0] SAI_API_PORT:_brcm_sai_link_event_cb:450 Port 58 link down event cause: LOCAL
Jan 22 07:02:59.950719 str-dx010-acs-4 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000010 status:2
Jan 22 07:02:59.950879 str-dx010-acs-4 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet40 oper state set from up to down
Jan 22 07:02:59.957473 str-dx010-acs-4 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet40
Jan 22 07:02:59.957473 str-dx010-acs-4 NOTICE swss#orchagent: :- flushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 1
Jan 22 07:02:59.957473 str-dx010-acs-4 NOTICE swss#orchagent: :- recordFlushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 1
Jan 22 07:02:59.958609 str-dx010-acs-4 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet40 admin:1 oper:0 addr:00:e0:ec:c2:af:7a ifindex:107 master:89
Jan 22 07:02:59.958834 str-dx010-acs-4 INFO kernel: [57136.763852] Bridge: port 9(Ethernet40) entered disabled state
Jan 22 07:02:59.959670 str-dx010-acs-4 NOTICE syncd#syncd: :- processFdbFlush: fdb flush succeeded, updating redis database
Jan 22 07:02:59.960828 str-dx010-acs-4 NOTICE syncd#syncd: :- processFlushEvent: received a flush port fdb event, portVid = oid:0x3a0000000005f5, bvId = oid:0x0
Jan 22 07:02:59.961054 str-dx010-acs-4 NOTICE syncd#syncd: :- processFlushEvent: pattern ASIC_STATE:SAI_OBJECT_TYPE_FDB_ENTRY:*, portStr oid:0x3a0000000005f5
Jan 22 07:02:59.965333 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_DYNAMIC
Jan 22 07:02:59.969774 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.004404 sec
Jan 22 07:02:59.970008 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_STATIC
Jan 22 07:02:59.973805 str-dx010-acs-4 NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.003965 sec

In our regression test, the test_configurable_drop_counters will shutdown an interface on leaf-fanout, as a result, the default route in APPL_DB will be lost, then test_hash will fail.
Is it expected?

lguohan · 2021-01-22T15:23:01Z

completely unexpected. I can repro it in the virtual dut. meanwhile, it seems only affecting default route. also, if I bring the port back, the default route is still missing.

prsunny · 2021-01-22T21:06:54Z

This seems to be a bug/behavior change in FRR/Zebra:

When any interface is shutdown, a route update is sent from zebra to fpmsyncd for eth0 and thats causing the route change to APPDB. Looks like routes in kernel default table are being reinstalled

Zebra log during Ethernet28 shutdown:

Jan 22 20:52:24.916767 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: MESSAGE: ZEBRA_INTERFACE_DOWN Ethernet28(0)
Jan 22 20:52:24.916859 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE fe80::c2d6:82ff:fe58:5c7c/64 on Ethernet28(0)
Jan 22 20:52:24.916901 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update: Scheduled VRF (ALL), event RIB_UPDATE_KERNEL
Jan 22 20:52:24.919486 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_handle_vrf_all: Handling VRF (ALL) event RIB_UPDATE_KERNEL
Jan 22 20:52:24.919693 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.919837 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.919984 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.920195 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.920258 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.920952 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
Jan 22 20:52:24.921461 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: fe80::/64 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.921689 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: fe80::/64 triggering update to FPM - Reason: new route selected
Jan 22 20:52:24.921842 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: 0.0.0.0/0 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.922002 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: 10.3.146.0/23 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.922277 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: ::/0 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.922792 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: fc00:2::/64 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.923429 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: fd00::/80 triggering update to FPM - Reason: installing in kernel
Jan 22 20:52:24.926347 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE fe80::c2d6:82ff:fe58:5c7c/64 on Ethernet28(0)
Jan 22 20:52:24.928381 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: rib_update: Scheduled VRF (ALL), event RIB_UPDATE_KERNEL
Jan 22 20:52:24.928442 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE fe80::/64, Proto: kernel, Metric: 0
Jan 22 20:52:24.928442 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 2, Gateway: <No address>, Recursive: no, Type: Directly connected, Encap type: none
Jan 22 20:52:24.928460 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE 0.0.0.0/0, Proto: kernel, Metric: 201
Jan 22 20:52:24.928503 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 2, Gateway: 10.3.146.1, Recursive: no, Type: IPv4 nexthop with ifindex, Encap type: none
Jan 22 20:52:24.928613 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE 10.3.146.0/23, Proto: kernel, Metric: 0
Jan 22 20:52:24.928664 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 2, Gateway: <No address>, Recursive: no, Type: Directly connected, Encap type: none
Jan 22 20:52:24.928716 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE ::/0, Proto: kernel, Metric: 201
Jan 22 20:52:24.928816 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 2, Gateway: fc00:2::1, Recursive: no, Type: IPv6 nexthop with ifindex, Encap type: none
Jan 22 20:52:24.928869 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE fc00:2::/64, Proto: kernel, Metric: 1024
Jan 22 20:52:24.928991 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 2, Gateway: <No address>, Recursive: no, Type: Directly connected, Encap type: none
Jan 22 20:52:24.929042 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM: zfpm_netlink_encode_route : RTM_NEWROUTE fd00::/80, Proto: kernel, Metric: 1024
Jan 22 20:52:24.929093 str2-7050cx3-acs-02 DEBUG bgp#zebra[26]: FPM:   Intf: 3, Gateway: <No address>, Recursive: no, Type: Directly connected, Encap type: none

Upon receiving this event, fpmsyncd just modifies default route in APPDB causing the issue

fpmsyncd logs during interface down

Jan 22 20:52:24.927558 str2-7050cx3-acs-02 INFO bgp#fpmsyncd: :- isRawProcessing: Rx MsgType:24 Encap:0
Jan 22 20:52:24.927940 str2-7050cx3-acs-02 DEBUG bgp#fpmsyncd: :- onRouteMsg: RouteTable set msg: fe80::/64 :: eth0
Jan 22 20:52:24.927940 str2-7050cx3-acs-02 INFO bgp#fpmsyncd: :- isRawProcessing: Rx MsgType:24 Encap:0
Jan 22 20:52:24.927940 str2-7050cx3-acs-02 DEBUG bgp#fpmsyncd: :- onRouteMsg: RouteTable set msg: 0.0.0.0/0 10.3.146.1 eth0
Jan 22 20:52:24.927940 str2-7050cx3-acs-02 INFO bgp#fpmsyncd: :- isRawProcessing: Rx MsgType:24 Encap:0

Check sonic-net#6483 add test to make sure default route change in eth0 does not affect the default route in the default vrf Signed-off-by: Guohan Lu <[email protected]>

lguohan · 2021-01-24T22:54:54Z

this only repro when there is such default route in the default table.

root@b263e2768225:/# ip route list table default
default via 172.17.0.1 dev eth0
2.2.2.2 via 172.17.0.1 dev eth0

Fix sonic-net/sonic-buildimage#6483 An FRR behavior change from 7.2 to 7.5 makes FRR update the default route to eth0 in interface up/down events. Skipping routes to eth0 or docker0 to avoid such behavior.

Check #6483 add test to make sure default route change in eth0 does not affect the default route in the default vrf Signed-off-by: Guohan Lu <[email protected]>

Fix sonic-net/sonic-buildimage#6483 An FRR behavior change from 7.2 to 7.5 makes FRR update the default route to eth0 in interface up/down events. Skipping routes to eth0 or docker0 to avoid such behavior.

bingwang-ms added the Master Branch Quality label Jan 18, 2021

daall added the Issue for 202012 label Jan 20, 2021

daall assigned lolyu Jan 20, 2021

lguohan added the P0 Priority of the issue label Jan 22, 2021

shi-su mentioned this issue Jan 23, 2021

[fpmsyncd] Skip routes to eth0 or docker0 sonic-net/sonic-swss#1606

Merged

lguohan mentioned this issue Jan 24, 2021

[vstest]: add default route vs test #6543

Merged

4 tasks

lguohan closed this as completed in sonic-net/sonic-swss#1606 Jan 25, 2021

lguohan added a commit that referenced this issue Jan 25, 2021

[vstest]: add default vs test

4b5212b

Check #6483 add test to make sure default route change in eth0 does not affect the default route in the default vrf Signed-off-by: Guohan Lu <[email protected]>

prsunny mentioned this issue Oct 25, 2021

Default route is not deleted from H/W in some cases #9055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect route for 0.0.0.0 in APPL_DB #6483

Incorrect route for 0.0.0.0 in APPL_DB #6483

bingwang-ms commented Jan 18, 2021 •

edited

Loading

bingwang-ms commented Jan 22, 2021 •

edited

Loading

lguohan commented Jan 22, 2021 •

edited

Loading

prsunny commented Jan 22, 2021

lguohan commented Jan 24, 2021

Incorrect route for 0.0.0.0 in APPL_DB #6483

Incorrect route for 0.0.0.0 in APPL_DB #6483

Comments

bingwang-ms commented Jan 18, 2021 • edited Loading

bingwang-ms commented Jan 22, 2021 • edited Loading

lguohan commented Jan 22, 2021 • edited Loading

prsunny commented Jan 22, 2021

lguohan commented Jan 24, 2021

bingwang-ms commented Jan 18, 2021 •

edited

Loading

bingwang-ms commented Jan 22, 2021 •

edited

Loading

lguohan commented Jan 22, 2021 •

edited

Loading