Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysready #8889

Merged
merged 1 commit into from
Nov 10, 2021
Merged

sysready #8889

merged 1 commit into from
Nov 10, 2021

Conversation

sg893052
Copy link
Contributor

@sg893052 sg893052 commented Oct 1, 2021

Why I did it

System Ready Feature Implementation as per https://github.com/Azure/SONiC/pull/875/files
At present, there is no mechanism to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it

A new python based System monitor framework is introduced to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it

"show system status core" click CLI
"show system status all" click CLI
Syslogs for system ready

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

@sg893052 sg893052 requested a review from lguohan as a code owner October 1, 2021 16:40
@ghost
Copy link

ghost commented Oct 1, 2021

CLA assistant check
All CLA requirements met.

@lgtm-com
Copy link

lgtm-com bot commented Oct 1, 2021

This pull request introduces 12 alerts when merging 3068490182bd487229aa451538852924019e1815 into 3e397ce - view on LGTM.com

new alerts:

  • 4 for Unused import
  • 2 for Unused local variable
  • 2 for Unreachable code
  • 2 for Variable defined multiple times
  • 1 for Unnecessary pass
  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2021

This pull request introduces 2 alerts when merging 5c28c6d4c91ce8c4a76ce3bd5b8d6f8fd9a453ba into df6361f - view on LGTM.com

new alerts:

  • 2 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2021

This pull request introduces 1 alert when merging e5bd0e9ee56db3a56abf110651a3c3b81d202606 into df6361f - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2021

This pull request introduces 1 alert when merging d820b985ef04b9922eaae828595743cef224e015 into df6361f - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Oct 11, 2021

This pull request introduces 1 alert when merging 4053ebbe5e6a4d1887723762ddd22675a1cb7640 into 7d40384 - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Oct 14, 2021

This pull request introduces 1 alert when merging 36f9aad2a24d409f9dfc743b55b039a9c81777b2 into b9366f3 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@Kalimuthu-Velappan
Copy link
Contributor

retest this please

@Kalimuthu-Velappan
Copy link
Contributor

/azpw run

@lgtm-com
Copy link

lgtm-com bot commented Oct 28, 2021

This pull request introduces 1 alert when merging 46cbef94a90d3e19214ff13f75ab6d8d33608f40 into 3788294 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@lgtm-com
Copy link

lgtm-com bot commented Oct 29, 2021

This pull request introduces 1 alert when merging fd11a1e4804cb8d49a2f854b6c40ae54c8349ff6 into 51c9c98 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

key "name";

leaf name {
description "host feature name in host Feature table";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples of host features like hostcfgd, caclmgrd...etc?

Copy link
Contributor Author

@sg893052 sg893052 Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included the example in description now.

"high_mem_alert": "disabled"
"high_mem_alert": "disabled",
{%- if feature in ["bgp", "swss", "pmon", "nat", "teamd", "dhcp_relay", "sflow", "l2mcd", "udld", "stp", "snmp", "lldp", "radv", "iccpd", "syncd", "vrrp", "mgmt-framework", "tam"] %}
"check_up_status" : "false"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this initialization, cant we assume if check_up_status field is not present, it's false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of framework provision for dockers to set "true" against "check_up_status" flag.
For now, all these dockers check_up_status is set to 'false'. Later once these dockers implement the logic of marking the up_status flag in STATE_DB, can set it to 'true' eventually.

sysready_lock.unlock()
except Exception as e:
logger.log_error( str(e))
time.sleep(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to periodically check the status every 2 secs? what's the CPU usage for this service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep(2) is invoked only in case of exception.
There is no periodic check.
As the design is on event based framework, the logic is to check the status of a service unit that was received in the queue. It will not hog the cpu.

@sg893052
Copy link
Contributor Author

sg893052 commented Nov 5, 2021

system_ready_ut.pdf

System Ready UT log file uploaded.

@lgtm-com
Copy link

lgtm-com bot commented Nov 5, 2021

This pull request introduces 1 alert when merging 3135d67 into 0290207 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@zhangyanzhao
Copy link
Collaborator

Venkat approved and all build check passed. Merge it

@zhangyanzhao zhangyanzhao merged commit d7e5372 into sonic-net:master Nov 10, 2021
lguohan added a commit that referenced this pull request Nov 10, 2021
This reverts commit d7e5372.
@lguohan
Copy link
Collaborator

lguohan commented Nov 10, 2021

to note, this pr has been reverted. per our discussion, i think this pr need mellanox folks to review and approve.

@liat-grozovik
Copy link
Collaborator

@zhangyanzhao as we have comments on the HLD itself i suggest to revert this PR as well.
it should not be merged without agreeing on the HLD.

@zhangyanzhao zhangyanzhao added the YANG YANG model related changes label Nov 11, 2021
@zhangyanzhao
Copy link
Collaborator

Add yang label since it has yang related change

@@ -0,0 +1,691 @@
#!/usr/bin/python3

from datetime import datetime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to package it into a proper python package, need unit test to validate the service.

spl_srv_list= ['database-chassis', 'gbsyncd']
core_srv_list = [
'swss.service',
'bgp.service',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure bgp.service is a core service, we need to check how this interack with application extension framework.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need to define whether a docker is a core or not in the docker's manifest.

import argparse
import multiprocessing as mp
from swsssdk import ConfigDBConnector
from swsssdk import SonicV2Connector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swsssdk is on the deprecation path, it cannot be used any more.


def db_connect():
try:
st_db = SonicV2Connector()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should all use the swsscommon library.

}
}
}
container HOST_FEATURE {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure we should introduce such yang model, why not in the feature table, we add a flag to indicate if it is host or container feature.

@@ -51,7 +51,27 @@
{%- if feature in ["lldp", "pmon", "radv", "snmp", "telemetry"] %}
"set_owner": "kube", {% else %}
"set_owner": "local", {% endif %} {% endif %}
"high_mem_alert": "disabled"
"high_mem_alert": "disabled",
{%- if feature in ["bgp", "swss", "pmon", "nat", "teamd", "dhcp_relay", "sflow", "l2mcd", "udld", "stp", "snmp", "lldp", "radv", "iccpd", "syncd", "vrrp", "mgmt-framework", "tam"] %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dhcp_relay is now an application extension package. We should not have it hardcoded as we may have images with dhcp_relay and without it.
We need to add this new field here - https://github.com/Azure/sonic-utilities/blob/master/sonic_package_manager/service_creator/feature.py#L34.
And we need to add a new manifest variable to control whether "check_up_status" should be up true or false which will also be an indication whether docker implements marking the up_status flag in STATE_DB.

spl_srv_list= ['database-chassis', 'gbsyncd']
core_srv_list = [
'swss.service',
'bgp.service',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need to define whether a docker is a core or not in the docker's manifest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
YANG YANG model related changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants