Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] | After upgrade from 202305 to 202311 telemetry still in config_db and makes system not ready #19081

Closed
dprital opened this issue May 26, 2024 · 4 comments · Fixed by #19153
Assignees
Labels
Issue for 202311 Triaged this issue has been triaged

Comments

@dprital
Copy link
Collaborator

dprital commented May 26, 2024

Description

After upgrade from 202305 to 202311, "telemetry" feature remain "enabled" on config_db although "telemetry" docker is no longer exist (gnmi docker replaced it).
Due to that, system is declared as "not ready":

admin@r-panther-13:~$ show system-health sysready-status 
System is not ready - one or more services are not up

Steps to reproduce the issue:

  1. Install (from ONIE) 202305 SONiC image. (SONiC.202305.555792-561bb5420)
  2. Make sure "telemetry" feature is enabled, run "config save -y" to have it saved on config_db.json
  3. Install 202311 image ( SONiC.202311.555802-ef65c9653) and reboot
  4. After system is up Run command: "sudo show system-health detail" and se that telemetry service appear as "Not OK".

Describe the results you received:

admin@r-panther-13:~$ show feature status
Feature         State            AutoRestart     SetOwner
--------------  ---------------  --------------  ----------
bgp             enabled          enabled
database        always_enabled   always_enabled
dhcp_relay      disabled         enabled         local
eventd          enabled          enabled
gnmi            enabled          enabled
lldp            enabled          enabled
macsec          disabled         enabled         local
mgmt-framework  enabled          enabled
mux             always_disabled  enabled
nat             disabled         enabled
pmon            enabled          enabled
radv            enabled          enabled
sflow           disabled         enabled
snmp            enabled          enabled
swss            enabled          enabled
syncd           enabled          enabled
teamd           enabled          enabled
telemetry       enabled          enabled

admin@r-panther-13:~$ redis-cli -n 4 hgetall "FEATURE|telemetry"
 1) "auto_restart"
 2) "enabled"
 3) "delayed"
 4) "True"
 5) "has_global_scope"
 6) "True"
 7) "has_per_asic_scope"
 8) "False"
 9) "high_mem_alert"
10) "disabled"
11) "state"
12) "enabled"
13) "support_syslog_rate_limit"
14) "true"

admin@r-panther-13:~$ show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
auditd                  OK                OK                  -
bgp                     OK                OK                  -
caclmgrd                OK                OK                  -
config-chassisdb        OK                OK                  -
config-setup            OK                OK                  -
containerd              OK                OK                  -
cron                    OK                OK                  -
database                OK                OK                  -
determine-reboot-cause  Starting          Starting            -
docker                  OK                OK                  -
eventd                  OK                OK                  -
gnmi                    OK                OK                  -
hw-management           OK                OK                  -
hw-management-tc        OK                OK                  -
kdump-tools             OK                OK                  -
lldp                    OK                OK                  -
mgmt-framework          OK                OK                  -
netfilter-persistent    OK                OK                  -
ntp                     OK                OK                  -
nv-syncd-shared         OK                OK                  -
pmon                    OK                OK                  -
procdockerstatsd        OK                OK                  -
radv                    OK                OK                  -
ras-mc-ctl              OK                OK                  -
rsyslog                 OK                OK                  -
smartmontools           OK                OK                  -
snmp                    OK                OK                  -
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -

admin@r-panther-13:~$ sudo show system-health detail 
System status summary

  System status LED  red
  Services:
    Status: Not OK
    Not Running: telemetry
  Hardware:
    Status: OK

System services and devices monitor list

Name                   Status    Type
---------------------  --------  ----------
telemetry              Not OK    Service
sonic                  OK        System
rsyslog                OK        Process
root-overlay           OK        Filesystem
var-log                OK        Filesystem
routeCheck             OK        Program
dualtorNeighborCheck   OK        Program
diskCheck              OK        Program
container_checker      OK        Program
vnetRouteCheck         OK        Program
memory_check           OK        Program
container_memory_snmp  OK        Program
container_memory_gnmi  OK        Program
container_eventd       OK        Program
eventd:eventd          OK        Process
database:redis         OK        Process
syncd:syncd            OK        Process
bgp:zebra              OK        Process
bgp:staticd            OK        Process
bgp:bgpd               OK        Process
bgp:fpmsyncd           OK        Process
bgp:bgpcfgd            OK        Process
teamd:teammgrd         OK        Process
teamd:teamsyncd        OK        Process
teamd:tlm_teamd        OK        Process
snmp:snmpd             OK        Process
snmp:snmp-subagent     OK        Process
lldp:lldpd             OK        Process
lldp:lldp-syncd        OK        Process
lldp:lldpmgrd          OK        Process
gnmi:gnmi-native       OK        Process
ASIC                   OK        ASIC
fan1                   OK        Fan
fan2                   OK        Fan
fan3                   OK        Fan
fan4                   OK        Fan
fan5                   OK        Fan
fan6                   OK        Fan
fan7                   OK        Fan
fan8                   OK        Fan
psu1_fan1              OK        Fan
psu2_fan1              OK        Fan
PSU 1                  OK        PSU
PSU 2                  OK        PSU

System services and devices ignore list

Name         Status    Type
-----------  --------  ------
psu.voltage  Ignored   Device

Describe the results you expected:

telemetry feature is not enabled, not exist on config_db and system health is ready:

admin@r-panther-13:~$ show system-health sysready-status 
System is ready

Output of show version:

Before upgrade:

SONiC Software Version: SONiC.202305.555792-561bb5420
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 561bb5420
Build date: Sat May 25 12:47:17 UTC 2024
Built by: AzDevOps@vmss-soni003RDR

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 09:28:44 up 36 min,  1 user,  load average: 1.31, 1.41, 1.28
Date: Sun 26 May 2024 09:28:44

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-mlnx             202305.555792-561bb5420   59fcb35e9d7b   837MB
docker-syncd-mlnx             latest                    59fcb35e9d7b   837MB
docker-orchagent              202305.555792-561bb5420   1f32ef57e739   330MB
docker-orchagent              latest                    1f32ef57e739   330MB
docker-fpm-frr                202305.555792-561bb5420   0acb92b5e34d   349MB
docker-fpm-frr                latest                    0acb92b5e34d   349MB
docker-nat                    202305.555792-561bb5420   3eac1efd75ae   321MB
docker-nat                    latest                    3eac1efd75ae   321MB
docker-sflow                  202305.555792-561bb5420   753719535f5d   319MB
docker-sflow                  latest                    753719535f5d   319MB
docker-teamd                  202305.555792-561bb5420   227f42007083   318MB
docker-teamd                  latest                    227f42007083   318MB
docker-macsec                 latest                    2cbb673a4674   320MB
docker-platform-monitor       202305.555792-561bb5420   93874a05300f   829MB
docker-platform-monitor       latest                    93874a05300f   829MB
docker-dhcp-relay             latest                    4183bd1aa6b8   308MB
docker-eventd                 202305.555792-561bb5420   c288be58206d   300MB
docker-eventd                 latest                    c288be58206d   300MB
docker-sonic-telemetry        202305.555792-561bb5420   114637010d93   387MB
docker-sonic-telemetry        latest                    114637010d93   387MB
docker-snmp                   202305.555792-561bb5420   afae0dfe9a30   339MB
docker-snmp                   latest                    afae0dfe9a30   339MB
docker-lldp                   202305.555792-561bb5420   dfb141c58476   343MB
docker-lldp                   latest                    dfb141c58476   343MB
docker-mux                    202305.555792-561bb5420   a938528c4f23   349MB
docker-mux                    latest                    a938528c4f23   349MB
docker-database               202305.555792-561bb5420   35bc1fc93417   300MB
docker-database               latest                    35bc1fc93417   300MB
docker-router-advertiser      202305.555792-561bb5420   235b9e87e32a   300MB
docker-router-advertiser      latest                    235b9e87e32a   300MB
docker-sonic-mgmt-framework   202305.555792-561bb5420   048c64c46cb0   414MB
docker-sonic-mgmt-framework   latest                    048c64c46cb0   414MB

After upgrade:

SONiC Software Version: SONiC.202311.555802-ef65c9653
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: ef65c9653
Build date: Sat May 25 11:33:25 UTC 2024
Built by: AzDevOps@vmss-soni003RDY

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 10:00:46 up 12 min,  1 user,  load average: 1.24, 1.53, 1.34
Date: Sun 26 May 2024 10:00:46

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-mlnx             202311.555802-ef65c9653   7c30f0f1cd3e   887MB
docker-syncd-mlnx             latest                    7c30f0f1cd3e   887MB
docker-platform-monitor       202311.555802-ef65c9653   802ef37e04b1   876MB
docker-platform-monitor       latest                    802ef37e04b1   876MB
docker-dhcp-relay             latest                    bc4d96bd50be   323MB
docker-macsec                 latest                    3306c4985870   342MB
docker-eventd                 202311.555802-ef65c9653   563d236abf2c   313MB
docker-eventd                 latest                    563d236abf2c   313MB
docker-orchagent              202311.555802-ef65c9653   e951ca26fe01   351MB
docker-orchagent              latest                    e951ca26fe01   351MB
docker-fpm-frr                202311.555802-ef65c9653   147dab40f2ab   371MB
docker-fpm-frr                latest                    147dab40f2ab   371MB
docker-nat                    202311.555802-ef65c9653   8b40899827bd   343MB
docker-nat                    latest                    8b40899827bd   343MB
docker-sflow                  202311.555802-ef65c9653   6d82a5dd5478   341MB
docker-sflow                  latest                    6d82a5dd5478   341MB
docker-teamd                  202311.555802-ef65c9653   3ae29818c2bb   340MB
docker-teamd                  latest                    3ae29818c2bb   340MB
docker-snmp                   202311.555802-ef65c9653   02c8217ce70c   352MB
docker-snmp                   latest                    02c8217ce70c   352MB
docker-router-advertiser      202311.555802-ef65c9653   e2fb88ab4a70   313MB
docker-router-advertiser      latest                    e2fb88ab4a70   313MB
docker-lldp                   202311.555802-ef65c9653   86b6cbf5ed87   356MB
docker-lldp                   latest                    86b6cbf5ed87   356MB
docker-mux                    202311.555802-ef65c9653   779b4d96e25d   362MB
docker-mux                    latest                    779b4d96e25d   362MB
docker-database               202311.555802-ef65c9653   376d4abda7a9   313MB
docker-database               latest                    376d4abda7a9   313MB
docker-sonic-gnmi             202311.555802-ef65c9653   4f7f81c3cc6d   401MB
docker-sonic-gnmi             latest                    4f7f81c3cc6d   401MB
docker-sonic-mgmt-framework   202311.555802-ef65c9653   e40cd25fa749   430MB
docker-sonic-mgmt-framework   latest                    e40cd25fa749   430MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@dprital
Copy link
Collaborator Author

dprital commented May 26, 2024

@ganglyu , it looks to me that move from "telemetry" to "gnmi" is not handled by db_migrator. Can you please check ?

@ganglyu
Copy link
Contributor

ganglyu commented May 28, 2024

Thanks, I will check

@arlakshm arlakshm added the Triaged this issue has been triaged label Jun 5, 2024
StormLiangMS pushed a commit that referenced this issue Jun 18, 2024
Why I did it
Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Jun 18, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
@ganglyu ganglyu reopened this Jun 18, 2024
@ganglyu
Copy link
Contributor

ganglyu commented Jun 18, 2024

Need to backport to 202311

mssonicbld pushed a commit that referenced this issue Jun 18, 2024
Why I did it
Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Jun 20, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
yxieca pushed a commit that referenced this issue Jun 21, 2024
Why I did it
Fix #19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.

Co-authored-by: ganglv <[email protected]>
@ganglyu
Copy link
Contributor

ganglyu commented Jun 21, 2024

PR is merged

@ganglyu ganglyu closed this as completed Jul 19, 2024
arun1355492 pushed a commit to arun1355492/sonic-buildimage that referenced this issue Jul 26, 2024
Why I did it
Fix sonic-net#19081
We have used gnmi container to replace telemetry container, and telemetry is still enabled after upgrade.
service_checker script reads from features table and check if the container is running, telemetry is enabled but there's no telemetry container.
It's difficult to disable telemetry in feature table for warm reboot and cold reboot, we need to check docker image in db migrator and minigraph.py.
When we use warm reboot to upgrade from 202305 to 202311, config_db still has telemetry configuration, and we can't simply remove related configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
I modify service_checker script:
If there's docker-sonic-telemetry image, check telemetry container.
If there's no docker-sonic-telemetry image, check gnmi container instead.
If there's no docker-sonic-telemetry image and docker-sonic-gnmi image, do not check telemetry.

How to verify it
Run unit test and end to end test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202311 Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants