Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SmartSwitch] Add a new API for the DPU chassis to query dataplane and midplane states #507

Conversation

oleksandrivantsiv
Copy link
Collaborator

Description

Add a definition for a new DPU chassis API required for querying DPU dataplane and midplane states.

Motivation and Context

A new API is required to enable querying of the DPU dataplane and midplane states. These states will be monitored by the chassisd service running on the DPU and pushed to the CHASSIS_STATE_DB upon any changes. This will allow the NPU to subscribe to DPU state changes.

How Has This Been Tested?

The API tests will be added in scope of chassisd changes.

Additional Information (Optional)

jleveque and others added 30 commits October 28, 2020 12:39
…ies in setup.py (sonic-net#106)

Remove dependence on the 'enum' package, as we are currently transitioning from Python 2 to Python 3 and there are installation conflict issues between the `enum` package and the `enum34` package.

Add 'sonic-py-common' as dependencies in setup.py for xcvrd, also add spaces around "equals" signs.
…onfig (sonic-net#108)

Add check to make sure that the initializeGlobalConfig is invoked only in multi-asic platforms.

Additionaly remove the initializeGlobalConfig() call in the DomUpdate thread and SFPUpdate process. This is because initializeGlobalConfig() is already invoked and initialized in the parent Xcvrd daemon which is available to the child thread/process.
…d on Python version (sonic-net#107)

Add dependence on 'enum' package back to xcvrd (basically reverting most of sonic-net/sonic-platform-daemons#106). However, in setup.py, we only install the enum34 package if the version of Python we are installing for is < 3.4. Thus, when installing the Python 3 xcvrd package in Python 2.7, the Python 2 version of enum34 will be installed. However, if installing the Python 3 xcvrd package on Python 3.7, enum34 will not be installed, causing xcrvd to import the 'enum' module from the standard library. This should prevent any conflicts which arise when 'enum34' is ever installed on Python versions >= 3.4 by preventing this situation.
…d status updates with xcvrd. (sonic-net#105)

* [xcvrd] support for integrating y cable within xcvrd

This PR provides the necessary infrastructure to initialize the Y cable Ports inside SONIC with xcvrd as the platform daemon.
Particularly there are two parts of integration:

While xcvrd initializes , there is within config_db for Y cable presence. This is done by checking the key-value pairs for
presence of mux_cable identifier as a key. Once a Y cable is found to be attached to a port, State DB is updated with
the corresponding data for the Y cable Port.

Once the init process is done, and a Y cable presence is established, A thread is run to periodically monitor changes
to APPL DB MUX_CABLE_COMMAND table for updates, and also one that periodically checks for a change events,  If an update is found, the corresponding changes are done on MUX using
sonic_y_cable package and corresponding changes are updated in STATE_DB

What is the motivation for this PR?
To add the necessary infrastructure for Credo Y cable integration within SONIC

How did you do it?
Added the necessary changes and a new xcvrd_utilities sub directory for utilities of y_cable code.
Reorganized the setup.py and sonix-xcvrd code to this form

sonic-xcvrd/setup.py
sonic-xcvrd/src/init.py
sonic-xcvrd/scripts/xcvrd → sonic-xcvrd/src/xcvrd.py
sonic-xcvrd/src/xcvrd_utilities/init.py
sonic-xcvrd/src/xcvrd_utilities/y_cable_helper.py

Signed-off-by: vaibhav-dahiya <[email protected]>
Introducing chassisd to monitor status of cards on a modular chassis

HLD: sonic-net/SONiC#646

**-What I did**
Introducing a new process to monitor status of control, line and fabric cards.

**-How I did it**
Support of monitoring of line-cards and fabric-cards. This runs in the main thread periodically.
It updates the STATE_DB with the status information. 'show platform chassis-modules' will read from the STATE_DB

Support of handling configuration of moving the cards to administratively up/down state. The handling happens as part
of a separate thread that waits on select() for config event from a CHASSIS_MODULE table in CONFIG_DB.
PSUd changes to computer power-budget for Modular chassis

HLD: sonic-net/SONiC#646

PSUd will introduce power requirements calculations. Platform APIs are introduced to provide consumers and total consumed power. Number of PSUs will help provide total supplied power

**Output of STATE-DB:**
```
  "CHASSIS_INFO|chassis_power_budget 1": {
    "expireat": 1603182970.639244,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "SUPERVISOR consumed_power": "80.0",
      "FABRIC-CARD consumed_power": "185.0",
      "FAN consumed_power": "999",
      "LINE-CARD consumed_power": "1000.0",
      "PSU supplied_power": "9000.0"
    }
  },
```
Enhance thermalctld to write to chassis state-DB on a modular chassis

HLD: sonic-net/SONiC#646

In a modular chassis, the thermal information from all line-cards
will be updated to the chassis state-DB in the control-card.

Additionally, minimum and maximum temperatures will be recorded.
The fan control algorithm used by certain vendors will require
this information.
Added changes in the sonic_xcvrd directory of sonic-platform-daemons, changed src dir to xcvrd dir for package generation and changed the setup.py to include the package xcvrd

Signed-off-by: vaibhav-dahiya <[email protected]>
…riable (sonic-net#112)

Previously, chassisd and thermalctld assumed that the swsscommon library would not be installed in the unit testing environment. This is not a valid assumption, and would cause unit tests to fail if swsscommon was available in the unit test environement, because it would get imported, but there would be no Redis DB to communicate with.

This PR uses environment variables, which are set by the unit tests themselves, to determine whether to load the real or mock libraries. This solution is similar to what is done in sonic-utilities.
…tup_function() (sonic-net#114)

Since these tests are run via unittest infrastructure, and not via Pytest, `setup_function()` is not the proper location to set these variables.
…onic-net#117)

Previously, psud assumed that the swsscommon library would not be installed in the unit testing environment. This is not a valid assumption, and would cause unit tests to fail if swsscommon was available in the unit test environment, because it would get imported, but there would be no Redis DB to communicate with.

This PR uses environment variables, which are set by the unit tests themselves, to determine whether to load the real or mock libraries. This solution is similar to what is done in sonic-utilities.
…variable (sonic-net#120)

When  `PSUD_UNIT_TESTING` and  `THERMALCTLD_UNIT_TESTING` variables don`t set we have the next problems:
```
 psud Traceback (most recent call last):
 psud File "/usr/local/bin/psud", line 21, in <module>
 psud if os.environ["PSUD_UNIT_TESTING"] == "1":
 psud File "/usr/lib/python2.7/UserDict.py", line 40, in __getitem__
 psud raise KeyError(key)
 psud KeyError: 'PSUD_UNIT_TESTING'
```

```
 thermalctld Traceback (most recent call last):
 thermalctld File "/usr/local/bin/thermalctld", line 19, in <module>
 thermalctld if os.environ["THERMALCTLD_UNIT_TESTING"] == "1":
 thermalctld File "/usr/lib/python2.7/UserDict.py", line 40, in __getitem__
 thermalctld raise KeyError(key)
 thermalctld KeyError: 'THERMALCTLD_UNIT_TESTING'
```

Also fixed the same issue in `chassisd`.

Signed-off-by: Petro Bratash <[email protected]>
…or physical entity mib (sonic-net#102)

* Update pmon daemons for SONiC Physical Entity MIB feature
Fixes the following crash introduced by sonic-net/sonic-platform-daemons#102:

```
01:33:00  ______________________ test_updater_thermal_check_min_max ______________________
01:33:00  
01:33:00      def test_updater_thermal_check_min_max():
01:33:00          chassis = MockChassis()
01:33:00      
01:33:00          thermal = MockThermal()
01:33:00          chassis.get_all_thermals().append(thermal)
01:33:00      
01:33:00          chassis.set_modular_chassis(True)
01:33:00          chassis.set_my_slot(1)
01:33:00          temperature_updater = TemperatureUpdater(SYSLOG_IDENTIFIER, chassis)
01:33:00      
01:33:00          temperature_updater.update()
01:33:00          slot_dict = temperature_updater.chassis_table.get('Thermal 1')
01:33:00  >       assert slot_dict['minimum_temperature'] == str(thermal.get_minimum_recorded())
01:33:00  E       TypeError: 'NoneType' object has no attribute '__getitem__'
01:33:00  
01:33:00  tests/test_thermalctld.py:341: TypeError
```

Signed-off-by: Petro Bratash <[email protected]>

Signed-off-by: Petro Bratash <[email protected]>
Without this change, leds were only set when an event happened.
Given that power supplies are assumed present by default, leds would never be set to `green`.
Instead they would have been left in the state the platform initialization left them (e.g `off`)
…alizeGlobalConfig (sonic-net#130)

The check for multiAsic before calling initializeGlobalConfig was done in xcvrd earlier. 
Adding now to the other processes in sonic-platform-daemons as well.
…r conditions/events (sonic-net#129)

* [xcvrd] Fix y_cable state update to unknown on erroraneous events

This PR provides the support for replacing the state DB updates from 'failure' to 'unknown' in case there is an error event in the functioning of Y cable
What is the motivation for this PR?
the schema agreed upon with linkmgr and orchagent interaction with xcvrd, is that if there is an error event xcvrd need to fill the state DB with 'unknown' as the state value rather than 'failure', this PR handles that

How did you do it?
identified error scenario's in the code and made the changes

Signed-off-by: vaibhav-dahiya <[email protected]>
…on (sonic-net#131)

Summary:
This PR provides replaces the logic to check mux_direction on the y_cable by checking the mux_direction register instead of actively linked and routing TOR register
Approach
added the changes in y_cable_helper.py by replacing the API

What is the motivation for this PR?
check_mux_direction is required as per design to replace the active_linked_tor_side
active_linked_tor_side -> check_mux_direction
check_mux_direction will be utlized as for establishing mux direction explicitly

Signed-off-by: vaibhav-dahiya <[email protected]>
Updating for completeness on how mock objects need to be imported

```
mprabhu@565bc0455e84:/sonic/src/sonic-platform-daemons/sonic-psud$ python2 setup.py test
running pytest
running egg_info
writing sonic_psud.egg-info/PKG-INFO
writing top-level names to sonic_psud.egg-info/top_level.txt
writing dependency_links to sonic_psud.egg-info/dependency_links.txt
reading manifest file 'sonic_psud.egg-info/SOURCES.txt'
writing manifest file 'sonic_psud.egg-info/SOURCES.txt'
running build_ext
==================================================================================== test session starts =====================================================================================
platform linux2 -- Python 2.7.16, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /sonic/src/sonic-platform-daemons/sonic-psud, inifile: pytest.ini
plugins: cov-2.6.0
collected 3 items

tests/test_psud.py ...                                                                                                                                                                 [100%]

---------- coverage: platform linux2, python 2.7.16-final-0 ----------
Name           Stmts   Miss  Cover
----------------------------------
scripts/psud     355    216    39%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml


================================================================================== 3 passed in 0.16 seconds ==================================================================================
```
…t#132)

python2 is end of life and SONiC is going to support python3. This PR is to change code in xcvrd, psud, thermalctld and syseeprom to make it compatible with both python3 and python2.
Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace.

Done using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
…obe in mux cable driver (sonic-net#134)

Summary:
This PR provides removes the delete logic on command probe message received from linkmgr after processing the message

What is the motivation for this PR?
the delete message tends to create an error scenario if many probe messages come and redis-api fails to retrieve the message contents

Signed-off-by: vaibhav-dahiya <[email protected]>
…net#133)

Summary:
This PR provides the necessary infrastructure to add pytest support and integration in sonic-xcvrd submodule.
This PR also adds unit tests for xcvrd daemon.
What is the motivation for this PR?
To add the pytest unittest support in sonic-platform-daemon, sonix-xcvrd daemon as well as add some unit tests
Signed-off-by: vaibhav-dahiya <[email protected]>
Enhance chassisd to monitor midplane status of the cards in modular chassis

HLD: sonic-net/SONiC#646

-What I did
Add monitoring of the midplane or internal ethernet network between supervisor and line-card modules.

-How I did it
Along with status monitoring, also monitor the midplane reachability between supervisor and modules.
It updates the STATE_DB with the status information. 'show chassis-modules midplane-status' will read from the STATE_DB
Why I did this?

xcvrd unit test failed when building it with python3: 

```
17:23:50  _____________________ ERROR collecting tests/test_xcvrd.py _____________________
17:23:50  tests/test_xcvrd.py:36: in <module>
17:23:50      class TestXcvrdScript(object):
17:23:50  tests/test_xcvrd.py:41: in TestXcvrdScript
17:23:50      @patch('xcvrd.xcvrd.logical_port_name_to_physical_port_list', MagicMock(return_value=[0]))
17:23:50  E   NameError: name 'patch' is not defined
```

How I did this?
import the package patch
…-net#137)

- Initialize self.presence and other variables in PsuStatus dunder init to False instead of True.
- Import datetime module.
- Discussions related to this issue can be seen in sonic-net/sonic-platform-daemons#136
- Add 100% unit test coverage of `PsuStatus` class in psud.
- Add skeleton of class to test `DaemonPsud` class
- Add test case for `get_psu_key()` and `try_get()` helper functions
- Add checks to import 'mock' from the 'unittest' package if running with Python 3

Overall psud unit test coverage increases from 39% to 51%.

Previous unit test coverage:

```
----------- coverage: platform linux, python 3.7.3-final-0 -----------
Name           Stmts   Miss  Cover
----------------------------------
scripts/psud     381    233    39%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
```

Unit test coverage with this patch:

```
----------- coverage: platform linux, python 3.7.3-final-0 -----------
Name           Stmts   Miss  Cover
----------------------------------
scripts/psud     381    185    51%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
```
Report Pytest unit test coverage for thermalctld.

Current coverage:

```
----------- coverage: platform linux, python 3.7.3-final-0 -----------
Name                  Stmts   Miss  Cover
-----------------------------------------
scripts/thermalctld     424    113    73%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
```

- Also add check to import 'mock' from the 'unittest' package if running with Python 3
- Refactor ledd:
    - Remove useless try/catch from around imports
    - Move argument parsing out of `DaemonLedd.run()` method and into `main()` function, a more appropriate location
    - Fix LGTM alert for unreachable code

- Add unit tests and report coverage:
    - Test passing good and bad command-line arguments to ledd process

Unit test coverage with this patch:
```
----------- coverage: platform linux, python 3.7.3-final-0 -----------
Name           Stmts   Miss  Cover
----------------------------------
scripts/ledd      66     34    48%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
```
stepanblyschak and others added 13 commits June 10, 2024 09:20
…onic-net#497)

* [CMIS] Skip re-init flow for SW-controlled ports in case of fastboot

Signed-off-by: vadymhlushko-mlnx <[email protected]>

* Change the log message

Signed-off-by: Stepan Blyschak <[email protected]>

---------

Signed-off-by: vadymhlushko-mlnx <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Co-authored-by: vadymhlushko-mlnx <[email protected]>
…media_settings.json (sonic-net#471)

* [xcvrd] Modify to support regular expression when parsing the key in media_settings.json

* fix unit test error

* add unit test for getting media settings value with regular expression

* define get_media_settings()

* apply the suggestion for if condition
…able (sonic-net#511)

* Initialize application specific fields as 'N/A' in TRANSCEIVER_INFO table

Signed-off-by: Mihir Patel <[email protected]>

* Changed a debug log to warning

* Modified log_error to log_warning

* Added comment for updating DB after xcvrd restart

---------

Signed-off-by: Mihir Patel <[email protected]>
…ng swsscommon table within the context (sonic-net#509)

* [ycabled][active-active] Fix in gRPC channel callback logic by creating
swsscommon table within the context

Signed-off-by: Vaibhav Dahiya <[email protected]>

* fix UT

Signed-off-by: Vaibhav Dahiya <[email protected]>

* add more tests

Signed-off-by: Vaibhav Dahiya <[email protected]>

* typo

Signed-off-by: Vaibhav Dahiya <[email protected]>

* add port

Signed-off-by: Vaibhav Dahiya <[email protected]>

* add logging

Signed-off-by: Vaibhav Dahiya <[email protected]>

* add tests

Signed-off-by: Vaibhav Dahiya <[email protected]>

---------

Signed-off-by: Vaibhav Dahiya <[email protected]>
…ne log messages with physical slot number (#530)

* [chassis][pmon][chassid] Enhance the chassid module on-line or off-line with physical slot num

---------

Signed-off-by: mlok <[email protected]>
…(#529)

* [PMON][psud] Fix the repeated NOTICE log message on Chassis platform

Signed-off-by: mlok <[email protected]>

* Fix the Unit test

---------

Signed-off-by: mlok <[email protected]>
* [xcvrd] Add logs to improve debugging in xcvrd

Signed-off-by: Mihir Patel <[email protected]>

* Fixed unit-test failure

* Improved code coverage

* Changed warning to notice

---------

Signed-off-by: Mihir Patel <[email protected]>
…s (#533)

* Enhance media_settings_parser for 100G xcvr and DPB etc

* Revert space change

* Cover corner cases

* Change log message level

* Fix docstring and update name of get_speed_lane_count_and_subport

* Address comment

* Change to re.fullmatch for lane_speed key
…ng custom NPU SI settings (#541)

* Xcvrd crash and restart should not cause link flap on platforms needing custom SI settings

Signed-off-by: Mihir Patel <[email protected]>

* Improved code coverage

---------

Signed-off-by: Mihir Patel <[email protected]>
@oleksandrivantsiv oleksandrivantsiv marked this pull request as draft October 22, 2024 21:34
@@ -280,26 +280,64 @@ def get_module_index(self, module_name):
# SmartSwitch methods
##############################################

def get_dpu_id(self, name):
def get_dpu_id(self, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv since this is a base class can we make use get_module_index() ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_module_index has a different meaning. It returns An integer, the index of the ModuleBase object in the module_list. get_dpu_id returns the physical ID of the DPU (its position)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the get_dpu_id already exists. I changed the parameter from name to **kwargs. This API will have a different meaning for switch and DPU. Please check the description

"""
return False

def get_dpu_dataplane_state(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv do we need to have dpu specified in function name or keep it generic as get_dataplane_state() since this is an abstract base class

"""
raise NotImplementedError

def get_dpu_controlplane_state(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv do we need to have dpu specified in function name or keep it generic as get_controlplane_state() since this is an abstract base class

@oleksandrivantsiv oleksandrivantsiv marked this pull request as ready for review October 23, 2024 17:13
@prgeor
Copy link
Collaborator

prgeor commented Oct 24, 2024

@rameshraghupathy can you review

"""
return False

def get_dataplane_state(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_state_info(self): API already covers this. Please refer to the HLD. This API appears to be redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy we discussed this last week. This is needed for the DPU chassisd to populate the data plane and control plane states from the DPU to the CHASSIS_STATE_DB. This API is defined on the chassis level. What you are referring to is the module-level API that will run on the NPU side and has a completely different meaning.

Copy link
Contributor

@rameshraghupathy rameshraghupathy Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
raise NotImplementedError

def get_controlplane_state(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_state_info(self): API already covers this. Please refer to the HLD. This API appears to be redundant.

Copy link

CLA Not Signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.