Improve feature mode switch process #12188

lixiaoyuner · 2022-09-27T02:35:36Z

Why I did it

Based on our new design about k8s project, need to refine container upgrade process, this PR fix several existing bugs to ensure container can be taken over by kube and can be upgraded successfully and can fallback, if something wrong happens, container can go back to local mode and run the latest stable version image.

How I did it

Refine ctrmgrd service, feature service and start service inside feature container

Fix reading node labels with header bug, need to add --without-header
Remove higher version checker to support fallback
Tag k8s scheduled stable version to latest, ensure when container go back to local mode, it would run the latest stable version
Remove ipv6 parameter when join, this parameter is not necessary

How to verify it

Enable k8s feature is SONiC device, after SONiC device joins k8s cluster, container deployed by k8s can replace SONiC local container and upgrade successfully, if something wrong happens, container can go back to local mode and run the latest stable version image.

Which release branch to backport (provide reason below if selected)

Fix existing bugs

Description for the changelog

Refine ctrmgrd service, feature service and start service inside feature container

Ensure to add label/tag for the feature raised. example - PR#2174 where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yun Li <[email protected]>

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

renukamanavalan · 2022-10-17T16:59:38Z

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

+def tag_latest_image(server, feat, docker_id, image_ver):
+    res = 1
+    if not UNIT_TESTING:
+        status = os.system("docker ps |grep {} >/dev/null".format(docker_id))


Make sure the running time of the container >= expected min time.
By the time, the timer fires, potentially it could have restarted.

Said that, systemd would have failed for frequent restarts.

Think through for any possible hole.

Currently, I set "tag_latest_image_on_wait_seconds" = 600s, it means when a k8s pod running for over 10 minutes(by monitor the pod's container id), I will tag the container's image. About this default value, do you have any ideas?

May be not. Suggest

Try restarting service during this period

Kill the main process in the container that will make the container exit within this period

Any more .... ;)

The situation you say should be:

k8s schedules container and remote_state is running

ctrmgrd capture remote_state ready -> running during last process, register timer_handler(600s, tag_function(previous_docker_id))

during the 600s, systemctl restart feature

The current reactions are:

feature systemd service will stop -> start -> wait, when stop, set remote_state=stopped, when start, not docker start, when wait, wait docker_id

k8s finds its scheduled container is killed, try to reschedule, container_up set remote_state=pending

ctrmgrd captures remote_state=pending, set remote_state=ready

container_up set remote_state=running and set docker_id

ctrmgrd capture remote_state ready -> running, will register timer_handler(600s, tag_function(current_docker_id))

and tag_latest_image_on_wait_seconds is up, tag function finds the previous_docker_id is disappeared, will do nothing

and tag_latest_image_on_wait_seconds is up, tag function finds the current_docker_id, tag latest.

Have tested this case and some other cases, all good.
Any other corner cases you want me to test?

And by the way, about removing 'tag as local' from reboot script, you want me to do it in the PR or in the future? I want to do in the future, because it seems redundant, but no real impacts. And it can ensure tag as local when reboot, maybe necessary for us.

I prefer we do it in this PR, as this PR's major purpose is to tag local.

One more Q: After you tag as "local" when you do service restart, who wins, systemd or kube.

I prefer systemd takes over. This would add the new version in label that should auto-block kube from deploying.
Can you check this out too?

Tag as local is in reboot script in sonic-utilities repo, removing from reboot script, PR is here. #Remove tag as local

Q: After you tag as "local" when do service restart, who wins?
Kube wins.
I see label such as feature_version_enabled=false will be set only when containers start up as local. Actually, when we restart the service, systemd stop will set remote_state=stopped so that systemd will wait k8s to reschedule. K8s finds its container is down, will reschedule immediately. In this process, container has no chance to start up as local to set feature_version_enabled=false. So, there is no auto-block.
About this problem, I think it's right that service is taken over by kube in this situation, because the set-owner is kube and k8s is connected. Taken over by kube should be the proper action.

By the way. I want to remove the instance_higher checker, because we want to support fallback in the future, if post-check failed, we could fall back to the previous version, I think the version should be managed by scheduler. What's your concern about this before, is there any necessary things so that we always need the higher version? #remove in this commit

Two points:

Please merge the reboot fix PR into this PR -- Like to keep all related in one PR.
For example, if we revert one, it should affect both. it should both either stay or go as one piece.

I agree with your explanation. Earlier tagging upon reboot only did not need any change, as state-db starts empty. Now that we are preponing the tagging, we need to do required updates that systemd takes over, if kube image is tagged as latest. In fact, if it makes simple, we can change from "kube running" to "kube-transient" once you tag it latest. On next systemd restart, seeing it kube-transient" we can go local path.

No need for kube to manage after local tagging. It is more stable if systemd takes over.

Point 1: About the two PRs, the reason is that the reboot script is in another repo called sonic-utilities, seems we can't raise a PR including two repos code.

Point 2: About after tagging, it's easy to let feature go back to local after restart, but one of our visions is that k8s can collect the containers real-time metrics, if features are taken over by systemd after tagging, we can't keep collecting the metrics.

Point 1 -- Agree.
Point 2 -- Debatable. What happens if device gets rebooted ? We have STATE-DB for counters/metrics.

Let it be so for now.

renukamanavalan

Looks good to me.
Just take care of "any possible loop hole" before tagging as local.
If you are doing this, we could remove the "tagging as local" from the reboot script.

lixiaoyuner · 2022-10-18T09:59:49Z

Looks good to me. Just take care of "any possible loop hole" before tagging as local. If you are doing this, we could remove the "tagging as local" from the reboot script.

Thanks @renukamanavalan, about possible loop hole, let me try to test this part of code carefully. And about removing "tagging as local" in reboot script, I don't think there will be some conflicts with PR, maybe remove it int the future if needed?

renukamanavalan · 2022-10-18T15:00:20Z

Looks good to me. Just take care of "any possible loop hole" before tagging as local. If you are doing this, we could remove the "tagging as local" from the reboot script.

Thanks @renukamanavalan, about possible loop hole, let me try to test this part of code carefully. And about removing "tagging as local" in reboot script, I don't think there will be some conflicts with PR, maybe remove it int the future if needed?

We will do it as one set. It is most likely to slip through the cracks otherwise. We just want one way of tagging local.
Just one more: Make sure, after tagged as local, the next service restart will let systemd take over and we don't need kube to manage this anymore.

qiluo-msft · 2022-11-02T20:26:14Z

src/sonic-ctrmgrd/ctrmgr/container_startup.py

-            do_freeze(feature, "bail out as current deploy version {} is not higher".
-                    format(version))
-            return
+        # if not instance_higher(feature, state_data[VERSION], version):


if not instance_higher(feature, state_data[VERSION], version):

Do not just comment code, remove them if they are dead. #Closed

qiluo-msft · 2022-11-02T20:26:38Z

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

    USE_K8S_PROXY: ""
    }

 def log_debug(m):
    msg = "{}: {}".format(inspect.stack()[1][3], m)
-    print(msg)
+    #print(msg)


#print(msg)

Remove the debug code. #Closed

qiluo-msft · 2022-11-02T20:29:38Z

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

@@ -172,11 +176,11 @@ def register_db(self, db_name):
            self.db_connectors[db_name] = swsscommon.DBConnector(db_name, 0)


-    def register_timer(self, ts, handler):
+    def register_timer(self, ts, handler, args=()):


args=()

Even it is allowed to use empty tuple (which is immutable) as default value, high suggest using None as default value. #Closed

qiluo-msft · 2022-11-02T20:30:31Z

src/sonic-ctrmgrd/ctrmgr/kube_commands.py

    err = ""
    out = ""
    ret = 0
    try:
-        local_ipv6 = _get_local_ipv6()
+        #local_ipv6 = _get_local_ipv6()
        #_download_file(server, port, insecure)


Delete dead code. #Closed

* Fix kube mode to local mode long duration issue * Remove IPV6 parameters which is not necessary * Fix read node labels bug * Tag the running image to latest if it's stable * Disable image_version_higher check * Change image_version_higher checker test case Signed-off-by: Yun Li <[email protected]>

mssonicbld · 2022-12-01T15:31:12Z

@lixiaoyuner PR conflicts with 202205 branch

* Fix kube mode to local mode long duration issue * Remove IPV6 parameters which is not necessary * Fix read node labels bug * Tag the running image to latest if it's stable * Disable image_version_higher check * Change image_version_higher checker test case Signed-off-by: Yun Li <[email protected]>

mssonicbld · 2022-12-02T08:13:23Z

Cherry-pick PR to 202205: #12919

* Fix kube mode to local mode long duration issue * Remove IPV6 parameters which is not necessary * Fix read node labels bug * Tag the running image to latest if it's stable * Disable image_version_higher check * Change image_version_higher checker test case Signed-off-by: Yun Li <[email protected]> Signed-off-by: Yun Li <[email protected]> Co-authored-by: lixiaoyuner <[email protected]>

lixiaoyuner added 3 commits September 27, 2022 02:21

Fix kube mode to local mode long duration issue

6ef5dd8

Signed-off-by: Yun Li <[email protected]>

Remove IPV6 parameters which is not necessary

d35d166

Signed-off-by: Yun Li <[email protected]>

Fix read node labels bug

86265d2

lixiaoyuner force-pushed the mode_switch_improvement branch from 479a90a to d4fbd3c Compare October 17, 2022 08:13

lixiaoyuner marked this pull request as ready for review October 17, 2022 08:40

lixiaoyuner requested a review from lguohan as a code owner October 17, 2022 08:40

lixiaoyuner requested a review from renukamanavalan October 17, 2022 09:30

lixiaoyuner added the Request for 202012 Branch label Oct 17, 2022

lixiaoyuner requested a review from qiluo-msft October 17, 2022 09:31

renukamanavalan reviewed Oct 17, 2022

View reviewed changes

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py Show resolved Hide resolved

renukamanavalan reviewed Oct 17, 2022

View reviewed changes

lixiaoyuner requested a review from renukamanavalan October 20, 2022 01:48

lixiaoyuner force-pushed the mode_switch_improvement branch from d4fbd3c to e3e9a10 Compare October 21, 2022 03:08

Tag the running image to latest if it's stable

7509d11

lixiaoyuner force-pushed the mode_switch_improvement branch from e3e9a10 to 7509d11 Compare October 21, 2022 06:13

lixiaoyuner added 2 commits October 24, 2022 09:59

Disable image_version_higher check

57bc905

Change image_version_higher checker test case

e059786

renukamanavalan approved these changes Oct 26, 2022

View reviewed changes

lixiaoyuner merged commit e1440f0 into sonic-net:master Nov 2, 2022

qiluo-msft reviewed Nov 2, 2022

View reviewed changes

lixiaoyuner mentioned this pull request Nov 3, 2022

Fix code irregular issues #12595

Merged

7 tasks

qiluo-msft added the Included in 202012 Branch label Nov 18, 2022

lixiaoyuner added the Request for 202205 Branch label Nov 28, 2022

lixiaoyuner mentioned this pull request Dec 1, 2022

Disable "tag as local" when reboot sonic-net/sonic-utilities#2451

Merged

yxieca added the Approved for 202205 Branch label Dec 1, 2022

mssonicbld added the Cherry Pick Conflict_202205 label Dec 1, 2022

lixiaoyuner added Approved for 202205 Branch and removed Approved for 202205 Branch labels Dec 2, 2022

mssonicbld added the Created PR to 202205 Branch label Dec 2, 2022

mssonicbld mentioned this pull request Dec 2, 2022

[action] [PR:12188] Improve feature mode switch process #12919

Merged

lixiaoyuner removed the Cherry Pick Conflict_202205 label Dec 4, 2022

mssonicbld added Included in 202205 Branch and removed Request for 202205 Branch Approved for 202205 Branch Created PR to 202205 Branch labels Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve feature mode switch process #12188

Improve feature mode switch process #12188

lixiaoyuner commented Sep 27, 2022 •

edited

Loading

renukamanavalan Oct 17, 2022

lixiaoyuner Oct 18, 2022

renukamanavalan Oct 18, 2022

lixiaoyuner Oct 19, 2022 •

edited

Loading

renukamanavalan Oct 21, 2022

lixiaoyuner Oct 24, 2022 •

edited

Loading

renukamanavalan Oct 25, 2022

lixiaoyuner Oct 26, 2022

renukamanavalan Oct 26, 2022

renukamanavalan left a comment

lixiaoyuner commented Oct 18, 2022

renukamanavalan commented Oct 18, 2022

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

mssonicbld commented Dec 1, 2022

mssonicbld commented Dec 2, 2022

Improve feature mode switch process #12188

Improve feature mode switch process #12188

Conversation

lixiaoyuner commented Sep 27, 2022 • edited Loading

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

renukamanavalan Oct 17, 2022

Choose a reason for hiding this comment

lixiaoyuner Oct 18, 2022

Choose a reason for hiding this comment

renukamanavalan Oct 18, 2022

Choose a reason for hiding this comment

lixiaoyuner Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

renukamanavalan Oct 21, 2022

Choose a reason for hiding this comment

lixiaoyuner Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

renukamanavalan Oct 25, 2022

Choose a reason for hiding this comment

lixiaoyuner Oct 26, 2022

Choose a reason for hiding this comment

renukamanavalan Oct 26, 2022

Choose a reason for hiding this comment

renukamanavalan left a comment

Choose a reason for hiding this comment

lixiaoyuner commented Oct 18, 2022

renukamanavalan commented Oct 18, 2022

qiluo-msft Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

if not instance_higher(feature, state_data[VERSION], version):

qiluo-msft Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

qiluo-msft Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

qiluo-msft Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

mssonicbld commented Dec 1, 2022

mssonicbld commented Dec 2, 2022

lixiaoyuner commented Sep 27, 2022 •

edited

Loading

lixiaoyuner Oct 19, 2022 •

edited

Loading

lixiaoyuner Oct 24, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading

qiluo-msft Nov 2, 2022 •

edited

Loading