Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] Add SDK dump with new SAI implementation #8

Merged
merged 152 commits into from
May 11, 2021

Conversation

DavidZagury
Copy link
Owner

@DavidZagury DavidZagury commented Mar 29, 2021

Why I did it

To create SDK dump on Mellanox devices when SDK event has occurred.

How I did it

Update SDK and SAI to a new version that supports this feature.
Set the SKUs keys needed to initialize the feature in SAI.

How to verify it

Simulate SDK event and check that dump is created in the expected path.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

mssonicbld and others added 3 commits March 29, 2021 08:07
Co-authored-by: mssonicbld <vsts@fv-az80-884.nqsemdo0cabejmrqkclmmohwag.dx.internal.cloudapp.net>
To add latest SAI drop REL_4.3.3.3 to SONIC which addresses the following CSP cases:

CS00012058054: [4.3][IPinIP][TTL-PIPE] IPinIP TTL Pipe Mode is NOT working it is behaving UNIFORM mode even programed as PIPE mode
CS00011227466: [4.3] Warmboot support with tunnel encap
)

Fix the following issues:

Spectrum-2, Spectrum-3 | Port | Fix link issue when using 25 GbE rate between two ports while one is on Spectrum-2-based system and the other is on Spectrum-3-based system
All | warmboot | fail to upgrade from earlier SONiC versions with official SDK/FW 4.4.2306 (was on SONiC 201911)
All | What-Just-Happened | When enabling or disabling WJH under high traffic load to the host CPU, in very specific and low probability conditions, an error could occur, that may result in loss of data, channel failure or in extreme cases SW failure

Signed-off-by: Volodymyr Samotiy <[email protected]>
Copy link

@shlomibitton shlomibitton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct path is "/var/log/mellanox/sdk-dumps" with a lower 'm'.
Did you generate an image with this commit? need to verify there are no degradations with new SDK/FW/SAI first.

@DavidZagury DavidZagury force-pushed the 202012-sdk-dumps branch 3 times, most recently from e416442 to 32edf76 Compare March 30, 2021 14:31
antony-rheneus and others added 19 commits March 31, 2021 01:00
Build Marvell kernel driver for prestera sai sdk
Builds interrupt and dma kernel driver
Removed the older method pre-compiled kernel module debian package and its makefile
)

The file device/mellanox/x86_64-mlnx_msn4410-r0/plugins/sfputil.py is not a software link for device/mellanox/x86_64-mlnx_msn2700-r0/plugins/sfputil.py. And it is still using python2 syntex which causes some SFP CLI error. The PR is to change it to a softlink and add 4410 support in device/mellanox/x86_64-mlnx_msn2700-r0/plugins/sfputil.py.
…7122)

Bumps [lxml](https://github.com/lxml/lxml) from 4.6.2 to 4.6.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-4.6.2...lxml-4.6.3)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Update unsupported SAI attr ('SAI_ACL_TABLE_ATTR_FIELD_OUTER_VLAN_ID') to fix issues on acl table create
this PR updates the following commits in sonic-platform-daemons
260cf2d [xcvrd] change firmware information fields name inside MUX_CABLE_INFO table for Y cable (sonic-net#165)
cfa600f [thermalctld] Initialize fan led in thermalctld for the first run (sonic-net#167)
8509f43 [thermalctld] Refactor to allow for greater unit test coverage; Add more unit tests (sonic-net#157)
70f4e7b [syseepromd] Update warning message to be more informative (sonic-net#160)

Signed-off-by: vaibhav-dahiya <[email protected]>
Integrate hw-management package V.7.0010.2002

Bug fixes:
Removing critical thermal zones to prevent unexpected software system shutdown:
*Kernel 4.9 -0071-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
*Kernel 4.19 -076-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
Removing redundant link for cpld3 for fixed systems (SN2100, SN2010).
Fix an issue with missed attribute for cpld3 (port CPLD) for SN2700, SN2410.

Signed-off-by: Stephen Sun <[email protected]>
…sonic-net#7164)

The psample module was not loaded on barefoot platform. The loading of this module is a prerequisite for testing SFlow.

* add `.gitignore` to the `barefoot` subdirectory to overwrite ignore "platform/**/debian/*" in the root directory
The default bgp connect retry timer is 120 seconds. A reconnection will happen 120 seconds if the initial connection fails. This PR aims to allow a more frequent retry.
…Metadata section of minigraph file (sonic-net#7166)

Backport of sonic-net#7031 to the 202012 branch

#### Why I did it

To enable parsing the `AutoNegotiation` element from the LinkMetadata section of minigraph file

#### How I did it

Parse the value `AutoNegotiation` element from the `LinkMetadata` section of minigraph file. If the element is present, an `autoneg` key will be added to the port in the `PORT` table of Config DB with a value of either `0` or `1`

If an `autoneg` value is present in port_config.ini, the value from the minigraph will take precedence, overriding that value.

Also remove `AutoNegotiation` and `EnableAutoNegotiation` elements from the `DeviceInfo` section, as we will use this data in the `LinkMetadata` section to determine whether to enable auto-negotiation for a port.
Temporary skip psud for Newport, for Barefoot needs.

Signed-off-by: Volodymyr Boyko <[email protected]>
…et#7153)

Signed-off-by: Yong Zhao [email protected]

Why I did it
If device reboot was caused by kernel panic, then we need retrieve and store the key information into the symbol file previous-reboot-cause.json. The CLI show reboot-cause will read this file to get the reason of previous reboot.

This PR is related to PR in sonic-utilities repo: sonic-net/sonic-utilities#1486

How I did it
The string variable previous_reboot_cause will be parsed to check whether it contains the keyword Kernel Panic. If it did, then store the keyword and time information into a dictionary.

How to verify it
I verified this change on a virtual testbed.

admin@vlab-01:/host/reboot-cause$ more previous-reboot-cause.json
{"gen_time": "2021_03_24_23_22_35", "cause": "Kernel Panic", "user": "N/A", "time": "Wed 24 Mar 2021 11:22:03 PM UTC", "comment": "N/A"}

admin@vlab-01:/host/reboot-cause$ show reboot-cause
Kernel Panic [Time: Wed 24 Mar 2021 11:22:03 PM UTC]
c5be3ca [psud] Increase unit test coverage; Refactor mock platform (sonic-net#154)
450b7d7 Bug fix: the fields that are not supported by vendor should be "N/A" in STATE_DB (sonic-net#168)

Signed-off-by: Stephen Sun <[email protected]>
…_startup.py (sonic-net#7154)

To improve management of docker-gbsyncd-vs. gbsyncd_startup.py simply spawned syncd processes and then exited. In that case, supervisord would no longer manage any processes in the container, and thus there was no way to know if a critical process had exited.

I recently created gbsyncdmgrd to be a more complete, robust replacement for gbsyncd_startup.py.

NOTE: This PR is dependent on the inclusion of gbsyncdmgrd in the sonic-sairedis repo. A submodule update is pending at
sonic-net#7089
…exit listener; Set all event buffer sizes to 1024 (sonic-net#7203)

#### Why I did it

Backport of sonic-net#7083 to the 202012 branch.

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to sonic-net#5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
…tilites submodules (sonic-net#7209)

sonic-swss
-[SFlowMgr] Sflow Crash on 200G ports handled (sonic-net#1683)
-Stablize the test case (sonic-net#1679)
-Remove PGs from an administratively down port. (sonic-net#1677)

sonic-swss-common
- fix getting hash from redis db (sonic-net#465)
- [dbconnector] Initialize redisContext (sonic-net#464)

sonic-utilities
- route_check: Fix hanging & logging level (sonic-net#1520)
- Add self timeout and crash if exceeded. (sonic-net#1502)
- [reboot] User-friendly reboot cause message for kernel panic (sonic-net#1486)
- [acl-loader]: do not add default deny rule for egress acl (sonic-net#1531)

Signed-off-by: Danny Allen <[email protected]>
stephenxs and others added 22 commits May 6, 2021 19:37
68ea9efc Add pg-drop script to sonic filesystem (sonic-net#1583)
b216bf0a Fixing serial number read to get from DB if it is populated (sonic-net#1580)
fa7230c6 Handle the new db version which mellanox_buffer_migrator isn't interested (sonic-net#1566)

Signed-off-by: Stephen Sun <[email protected]>
…et#7543)

#### Why I did it

- PSU data is loaded into state DB.
   Following errors are seen in syslogs:
  "Failed to update PSU data - '<=' not supported between instances of 'float' and 'str'"

- Issue is not seen in master image as the PSU API return type is different.

#### How I did it

- Changed the return type in PSU API's.
…t#7547)

* [202012] Add SOC property to enable AN/LT on some platforms

Why I did it
To enable autonegotiation/link training on some Broadcom-based platforms (Arista 7060CX, 7260CX3, 7050cx3, Celestica DX010)

How I did it
Add appropriate SOC property for enabling the feature to the Broadcom config files of appropriate platforms
Also convert line endings to UNIX format for one Celestica file

* Add 'phy_an_lt_msft' to BCM config file permitted list
cleanup the build commands after build finished.
The default value is 600 minutes, it is not enough when building multiple images for a platform, change to 720 minutes.
…rm device facts. (sonic-net#7496)

- Why I did it
Current platform.json lacks some peripheral device related facts, like chassis/fan/pasu/drawer/thermal/components names, numbers, etc.

- How I did it
Add platform device facts to the platform.json file

- How to verify it
Run sonic-mgmt platform API tests which depend on these facts.

Signed-off-by: Kebo Liu <[email protected]>
…7535)

#### Why I did it

MSN4700 A1/A0 used different sensor chip but keep the existing platform name *x86_64-mlnx_msn4700-r0*, this is a workaround to replace the sensor conf on MSN4700 A1/A0

#### How I did it

Use a shell script to get the sensor conf path and copy that files to /etc/sensors.d/sensors.conf
…nic-net#7563)

* [202012][swss/swss-common/utilities/platform-daemons] Update submodule

sonic-swss
- [flex-counters] Delay flex counters stats init for faster boot time [202012] (sonic-net#1736)

sonic-swss-common
- [swig] allow threads (sonic-net#477)

sonic-utilities
- [sfpshow] Gracefully handle improper 'specification_compliance' field (sonic-net#1594)

sonic-platform-daemons
- [xcvrd] Change the y_cable presence logic to use "mux_cable" table as identifier from Config DB (sonic-net#176)
- [xcvrd] Enhance Media Settings (sonic-net#177)

Signed-off-by: Danny Allen <[email protected]>
Changed DellEMC Z9932f media settings from Vendor Name + PN method to common method.
rules/config.user allows overriding default properties without
touching tracked files. This change makes sure all properties
can be set and not just the ones used in slave.mk.

Signed-off-by: Christian Svensson <[email protected]>
Why I did it
After PR sonic-net#7344, 'make init' and/or 'make reset' will also build sonic slave dockers.

'-include rules/config.user' is supposed to be fine when the file is missing. However, when the file is missing, it generates a delayed error which later causes make init and make reset trying to build the sonic slave dockers.

How I did it
Define a do-nothing target for config.user to catch config.user build therefore preventing other builds to be triggered unexpectedly.

How to verify it
did make init and it is now only doing submodule init.
…7475)

Previously, a brief sleep was necessary in order to get Python threads to progress. The root cause of this has since been found and fixed in sonic-swss-common: sonic-net/sonic-swss-common#477. The submodule was updated here, so we can now safely remove this sleep.

This PR should also be cherry-picked to the 202012 branch once the submodule is updated there to also include the fix.
…c-net#7474)

Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.

How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.

How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
…sing files or sockets (sonic-net#7509)

fuser support is required since new cisco hardware watchdog plugin uses them to check anyone else use's /dev/watchdogX resource. The actual validation happens in the platform code, but the package is required for pmon container. Currently the /dev/watchdogX is being used by cisco platform-monitor service. Cisco chassis level watchdog plugin uses "fuser" to claim the watchdog release from platform-monitor service.
LED_PROC_INIT_SOC variable was incorrectly referenced as LED_SOC_INIT_SOC. Introduced in sonic-net#5483

Rather than fixing the typo, I decided to simplify the script, removing the need for the conditional altogether by moving the bcmcmd call inside the conditional which checks for the presence of LED_SOC_INIT_SOC.
https://github.com/mbj4668/pyang/blob/master/pyang/repository.py#L93 throws an exception with pip 21.1

add ietf yang model explicitly to the build process fix the test failure.

tests/test_sonic_yang_models.py .F [ 66%]
tests/yang_model_tests/test_yang_model.py . [100%]
Failed: pyang -f tree ./yang-models/*.yang > ./yang-models/sonic_yang_tree
----------------------------- Captured stderr call -----------------------------
./yang-models/sonic-acl.yang:8: error: module "ietf-inet-types" not found in search path
./yang-models/sonic-device_metadata.yang:8: error: module "ietf-yang-types" not found in search path

Signed-off-by: Guohan Lu <[email protected]>
@DavidZagury DavidZagury merged commit a10ce57 into 202012 May 11, 2021
DavidZagury pushed a commit that referenced this pull request Dec 8, 2021
Update Barefoot platform support for Bullseye and 5.10 kernel, and add
python3-venv.
@DavidZagury DavidZagury deleted the 202012-sdk-dumps branch December 8, 2021 09:04
DavidZagury pushed a commit that referenced this pull request Jan 12, 2022
#### What I did 

[sonic-linkmgrd][master] submodule update

6c6151b Fix unstable unit tests (state change handler wasn't invoked) (#8)
2f7dc0a support code diff coverage (#5)
83f0002 Force mux state switch to standby if triggered from Cli (#6)

signed-off-by: Jing Zhang [email protected]
DavidZagury pushed a commit that referenced this pull request Feb 27, 2022
* [BFN] Updated platform APIs impl

Signed-off-by: Andriy Kokhan <[email protected]>

* Extended BFN platform SFP APIs implementation

* Update sfp.py

* [BFN] Extended SFP platform plugin implementation

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN] Extended Fans platform plugin implementation

* [BFN] divided classes Fan and  FanDrawer into 2 files

* Signed-off-by: Vadym Yashchenko <[email protected]>

What I did
	Add get_model() function
	Add get_low_critical_threshold() function
	Change __get(...) function.
How I did it
	Differnece from previous implementation of __get(...) function is return real value or -9999.9 if value is not provided by thrift API

* Add get_presence() function and revised __get() function

Signed-off-by: Vadym Yashchenko <[email protected]>

* [BFN] Updated PSU platform APIs impl

Signed-off-by: Dmytro Lytvynenko <[email protected]>

* Added BFN PSU cache (#9)

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN]  Fans and Fantray platform APIs update (#7)

* [BFN] Updated SFP platform APIs (#10)

Signed-off-by: Volodymyr Boyko <[email protected]>

* [BFN] Updated platform API for thermal (#8)

* Signed-off-by: Vadym Yashchenko <[email protected]>

* Revert "[BFN]  Fans and Fantray platform APIs update (#7)" (#11)

This reverts commit c62a733.

* Add support health monitor system (#15)

Signed-off-by: Petro Bratash <[email protected]>

* Update chassis.py

* [BFN] Updated FANs and FAN Tray platform API (#14)

* Fix fix_alignment (#17)

Signed-off-by: Petro Bratash <[email protected]>

* [BFN] Improvement show environment (#16)

* Added PSU temperature skip into platform.json (#18)

Signed-off-by: Andriy Kokhan <[email protected]>

* Do not skip psud on Newport

Signed-off-by: Andriy Kokhan <[email protected]>

* [BFN] fix fan status from Not OK to Ok (#19)

* [BFN] Updated SFP platform plugin (#13)

Signed-off-by: Volodymyr Boyko <[email protected]>

* [DPB] Fix typo for Ethernet0 2x200G[100G,40G] breakout mode (#21)

Signed-off-by: Mykola Gerasymenko <[email protected]>

* [barefoot] Tmp fix vendor_rev (#22)

Signed-off-by: Volodymyr Boyko <[email protected]>

* Fixed python issues in sonic_platform/fan_drawer.py

Signed-off-by: Andriy Kokhan <[email protected]>

* Updated fan_drawer.py

* Fixing trailing white spaces in fan_drawer.py

* [BFN] Fix thrift for SFPs API

Signed-off-by: Volodymyr Boyko <[email protected]>

* In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue

Signed-off-by: Andriy Kokhan <[email protected]>

* [Newport] Thermal manager  (#23)

* Signed-off-by: Vadym Yashchenko <[email protected]>

* Revert "In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue"

This reverts commit 1e73127.

* Removed 'controllable' options from platform.json to fix factory default config generation

Signed-off-by: Andriy Kokhan <[email protected]>

* Update thermal_manager.py

* Migrated SFP plugin to sonic_xcvr API (#30)

Signed-off-by: Andriy Kokhan <[email protected]>

Co-authored-by: KostiantynYarovyiBf <[email protected]>
Co-authored-by: Vadym Yashchenko <[email protected]>
Co-authored-by: Dmytro Lytvynenko <[email protected]>
Co-authored-by: Volodymyr Boiko <[email protected]>
Co-authored-by: Petro Bratash <[email protected]>
Co-authored-by: Mykola Gerasymenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.