Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd]: wait for swss db flush done before starting teamd container #2626

Closed
wants to merge 2 commits into from

Conversation

jipanyang
Copy link
Collaborator

Signed-off-by: Jipan Yang [email protected]

- What I did
This a try to fix #2606

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@@ -97,6 +97,7 @@ start() {
/usr/bin/docker exec database redis-cli -n 2 FLUSHDB
/usr/bin/docker exec database redis-cli -n 5 FLUSHDB
clean_up_tables 6 "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*'"
/usr/bin/docker exec database redis-cli -n 0 HSET "SWSS_DB_FLUSH_DONE" "1"
Copy link
Contributor

@jleveque jleveque Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I foresee here is that SWSS_DB_FLUSH_DONE never gets set to 0 or deleted. Take for instance the new logic I implemented in the 201803 branch (and will soon implement in the master/201811 branches, as well). If a critical process crashes in the swss container, the container will exit causing the swss service to restart itself and its dependent services. At this point, swss restarts and subsequently restarts teamd. It's possible that teamd would check the value before the databases get flushed, yet SWSS_DB_FLUSH_DONE would still be set to 1, causing teamd to start before the databases are flushed. How can we reliably set this value to 0 or delete the key when applicable? maybe deleting the key when the swss service stops is enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least SWSS_DB_FLUSH_DONE should be set as 0 upon swss service stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this change with system warm reboot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca I tested with cold reboot and teamd docker warm restart, but not system warm reboot. Did you see any issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.

I my test. I think you change is equal to add service teamd after syncd. I don't see behavior change from one to another. Can you test that change?

diff --git a/files/build_templates/teamd.service.j2 b/files/build_templates/teamd.service.j2
index 792b824..bde55c6 100644
--- a/files/build_templates/teamd.service.j2
+++ b/files/build_templates/teamd.service.j2
@@ -2,6 +2,7 @@
Description=TEAMD container
Requires=updategraph.service
After=updategraph.service
+After=syncd.service
Before=ntp-config.service

[Service]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with recent systemd service change, this should work too.

The difference might be bootup speed (for both cold boot and warm boot), but I don't have any concrete data.

@jipanyang
Copy link
Collaborator Author

jipanyang commented Mar 1, 2019

Wait for swss to finish db flushing:

root@vlab-01:/tmp# ls -l
total 8
-rw-r--r-- 1 root  root    0 Mar  1 08:19 dump.rdb
-rw-r--r-- 1 root  root  824 Mar  1 08:19 swss-syncd-debug.log
-rw-r--r-- 1 root  root    0 Mar  1 08:19 swss-syncd-lock
-rw-r--r-- 1 admin admin 638 Mar  1 08:19 teamd_debug.log
root@vlab-01:/tmp# cat teamd_debug.log 
Fri Mar 1 08:19:24 UTC 2019 - Start waiting for swss
Fri Mar 1 08:19:28 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:33 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:35 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:37 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:38 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:40 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:42 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:43 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:45 UTC 2019 - Swss db flush done

SWSS_DB_FLUSH_DONE flag and correct mtu for lag in appDB.

root@vlab-01:/tmp# redis-cli -p 6379
127.0.0.1:6379> keys SWS*
1) "SWSS_DB_FLUSH_DONE"
127.0.0.1:6379> get SWSS_DB_FLUSH_DONE
"1"
127.0.0.1:6379> exit
root@vlab-01:/tmp# redis-cli -p 6379
127.0.0.1:6379> hgetall "LAG_TABLE:PortChannel0001"
1) "mtu"
2) "9100"
3) "admin_status"
4) "up"
5) "oper_status"
6) "up"

root@vlab-01:/home/admin# systemctl stop swss
then
systemctl start swss


127.0.0.1:6379> 
127.0.0.1:6379> get SWSS_DB_FLUSH_DONE
"0"
127.0.0.1:6379> get SWSS_DB_FLUSH_DONE
"1"
127.0.0.1:6379> 

mtu gone after swss restart.

127.0.0.1:6379> hgetall "LAG_TABLE:PortChannel0001"
1) "admin_status"
2) "up"
3) "oper_status"
4) "up"
127.0.0.1:6379> 

root@vlab-01:/home/admin# systemctl restart teamd

mtu push back.

127.0.0.1:6379> hgetall "LAG_TABLE:PortChannel0001"
1) "admin_status"
2) "up"
3) "oper_status"
4) "up"
5) "mtu"
6) "9100"
root@vlab-01:/home/admin# cat /tmp/teamd_debug.log 
Fri Mar 1 08:19:24 UTC 2019 - Start waiting for swss
Fri Mar 1 08:19:28 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:33 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:35 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:37 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:38 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:40 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:42 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:43 UTC 2019 - Wait 1 second for SWSS_DB_FLUSH_DONE
Fri Mar 1 08:19:45 UTC 2019 - Swss db flush done
Fri Mar 1 08:36:05 UTC 2019 - Start waiting for swss
Fri Mar 1 08:36:06 UTC 2019 - Swss db flush done

@jleveque
Copy link
Contributor

jleveque commented Mar 8, 2019

Retest this please

@jipanyang
Copy link
Collaborator Author

Close it in favor of #2724

@jipanyang jipanyang closed this Apr 2, 2019
yxieca added a commit to yxieca/sonic-buildimage that referenced this pull request Feb 21, 2023
…nux-kernel] advance submodule head

linkmgrd:
* 3e7a9df 2023-02-19 | [active-active] Toggle to standby if default route is missing (sonic-net#171) (HEAD -> 202205) [Longxiang Lyu]
* 8ab1b2b 2023-02-15 | [active-active] fix issue that interfaces get stuck in `active` if service starts up with link state down (sonic-net#169) [Jing Zhang]
* df862ad 2023-02-11 | Fix mux config when gRPC connection is lost (sonic-net#166) [Longxiang Lyu]

utilities:
* 8aa7930c 2023-02-13 | [portstat CLI] don't print reminder if use json format (sonic-net#2670) (HEAD -> 202205, github/202205) [wenyiz2021]
* 4e3bb6fa 2023-02-21 | Add "show fabric reachability" command. (sonic-net#2672) [jfeng-arista]
* 3587a94b 2023-02-18 | [202205][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan (sonic-net#2680) [Yaqiang Zhu]
* 4f07f7f0 2023-02-10 | Skip saidump for Spine Router as this can take more than 5 sec (sonic-net#2637) (sonic-net#2671) [kenneth-arista]
* e61c5ec4 2023-02-10 | [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (sonic-net#2660) (sonic-net#2669) [Yaqiang Zhu]

swss:
* 1bbf725 2023-02-14 | [Workaround] EvpnRemoteVnip2pOrch warmboot check failure (sonic-net#2626) (HEAD -> 202205) [jcaiMR]
* 380f72b 2023-02-20 | Support for tc-dot1p and tc-dscp qosmap (sonic-net#2559) [Divya Mukundan]
* dbf6fcc 2022-11-01 | Added LAG member check on addLagMember() (sonic-net#2464) [Andriy Kokhan]

swss-common:
* b31391b 2023-02-21 | Prevent sonic-db-cli generate core dump (sonic-net#749) (HEAD -> 202205) [Hua Liu]
* 16ff689 2022-12-13 | Support for TC-DOT1p qos map (sonic-net#721) [Divya Mukundan]

platform-daemons:
* fb92af4 2023-02-09 | [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (sonic-net#338) (HEAD -> 202205) [vdahiya12]

linux-kernel:
* 4e62401 2023-02-09 | Update linux kernel for hw-mgmt V.7.0020.4104 (sonic-net#305) (HEAD -> 202205) [Stephen Sun]

Signed-off-by: Ying Xie <[email protected]>
yxieca added a commit that referenced this pull request Feb 22, 2023
…nux-kernel] advance submodule head (#13906)

linkmgrd:
* 3e7a9df 2023-02-19 | [active-active] Toggle to standby if default route is missing (#171) (HEAD -> 202205) [Longxiang Lyu]
* 8ab1b2b 2023-02-15 | [active-active] fix issue that interfaces get stuck in `active` if service starts up with link state down (#169) [Jing Zhang]
* df862ad 2023-02-11 | Fix mux config when gRPC connection is lost (#166) [Longxiang Lyu]

utilities:
* 8aa7930c 2023-02-13 | [portstat CLI] don't print reminder if use json format (#2670) (HEAD -> 202205, github/202205) [wenyiz2021]
* 4e3bb6fa 2023-02-21 | Add "show fabric reachability" command. (#2672) [jfeng-arista]
* 3587a94b 2023-02-18 | [202205][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan (#2680) [Yaqiang Zhu]
* 4f07f7f0 2023-02-10 | Skip saidump for Spine Router as this can take more than 5 sec (#2637) (#2671) [kenneth-arista]
* e61c5ec4 2023-02-10 | [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (#2660) (#2669) [Yaqiang Zhu]

swss:
* 1bbf725 2023-02-14 | [Workaround] EvpnRemoteVnip2pOrch warmboot check failure (#2626) (HEAD -> 202205) [jcaiMR]
* 380f72b 2023-02-20 | Support for tc-dot1p and tc-dscp qosmap (#2559) [Divya Mukundan]
* dbf6fcc 2022-11-01 | Added LAG member check on addLagMember() (#2464) [Andriy Kokhan]

swss-common:
* b31391b 2023-02-21 | Prevent sonic-db-cli generate core dump (#749) (HEAD -> 202205) [Hua Liu]
* 16ff689 2022-12-13 | Support for TC-DOT1p qos map (#721) [Divya Mukundan]

platform-daemons:
* fb92af4 2023-02-09 | [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (#338) (HEAD -> 202205) [vdahiya12]

linux-kernel:
* 4e62401 2023-02-09 | Update linux kernel for hw-mgmt V.7.0020.4104 (#305) (HEAD -> 202205) [Stephen Sun]

Signed-off-by: Ying Xie <[email protected]>
AntonHryshchuk added a commit to AntonHryshchuk/sonic-buildimage that referenced this pull request Feb 22, 2023
Update sonic-swss submodule pointer to include the following:
* f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([sonic-net#2511](sonic-net/sonic-swss#2511))

Signed-off-by: AntonHryshchuk <[email protected]>
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Feb 23, 2023
Update sonic-swss submodule pointer to include the following:
* baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed  ([sonic-net#2669](sonic-net/sonic-swss#2669))
* f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([sonic-net#2511](sonic-net/sonic-swss#2511))

Signed-off-by: dprital <[email protected]>
prsunny pushed a commit that referenced this pull request Feb 23, 2023
Update sonic-swss submodule pointer to include the following:
* baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed  ([#2669](sonic-net/sonic-swss#2669))
* f66abed Support for tc-dot1p and tc-dscp qosmap ([#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([#2511](sonic-net/sonic-swss#2511))
StormLiangMS added a commit that referenced this pull request Mar 8, 2023
Why I did it
submodule advance

b085b5f - [ci] Fix pipeline error about team5 not found. (Core dump in orchagent when assigning router interface to a vlan with untagged mode  #2684) (3 hours ago) [Liu Shilong]
4549b4c - Fix issue: there is no retry while creating a RIF which is in removing state ([201811 sub-module] advance sub-modules: utilities, swss, swss-common #2679) (3 hours ago) [Junchao-Mellanox]
980a45b - [FDB]Fixing FDB consolidated flush for Remote MACs (pmon to stretch #2673) (3 hours ago) [Sudharsan Dhamal Gopalarathnam]
c646607 - Do not allow to add port to .1Q bridge while router port deletion is not completed (Update SDK, FW and SAI #2669) (3 hours ago) [Lior Avramov]
4a321f0 - [orchagent]: Get bridge port ID from orchagent cache instead of SAI API ([201811 sub module] advance sairedis sub module #2657) (3 hours ago) [Lawrence Lee]
f4b88f3 - [Dual-ToR] handle 'mux_tunnel_egress_acl' attrib in order to change ACL configuration (drop on ingress/egress) on standby ToR (lm75 doesn't support written alarm to syslog. #2646) (3 hours ago) [Andriy Yurkiv]
a4f29c1 - [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([teamd]: wait for swss db flush done before starting teamd container #2626) (3 hours ago) [jcaiMR]
53ee0a8 - Support for tc-dot1p and tc-dscp qosmap ([201803] [router-advertiser] Add templated script to wait for pertinent interfaces to be ready before starting radvd #2559) (3 hours ago) [Divya Mukundan]
b953866 - [dual-tor] add missing SAI attribte in order to create IPNIP tunnel (Config reload/load_minigraph not clearing State DB #2503) (3 hours ago) [Andriy Yurkiv]
How I did it
How to verify it
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this pull request Mar 28, 2023
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
mihirpat1 pushed a commit to mihirpat1/sonic-buildimage that referenced this pull request Jun 14, 2023
mihirpat1 pushed a commit to mihirpat1/sonic-buildimage that referenced this pull request Jun 14, 2023
…ic-net#2756)" (sonic-net#2773)

This reverts commit 750e064.
Reverts the PR sonic-net#2756

The fix added breaks the previously added workaround sonic-net#2626. Hence requesting to revert the fix.
Once we find a proper solution for sonic-net#12361 we need to reintegrate this PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cold-boot: teammgrd PORTCHANNEL configuration flushed by swss docker start script swss.sh
3 participants