Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] Backport patch to remove critical trip point from thermal zones #201

Merged
merged 1 commit into from
Apr 1, 2021

Conversation

stephenxs
Copy link
Contributor

@stephenxs stephenxs commented Mar 25, 2021

Backport a patch to remove critical trip point from thermal zones for Mellanox devices

  1. 0027-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
    Disable software thermal protection by removing critical trip points
    for the all thermal zones.
    
    According to the system requirements software should never perform
    system thermal protection, since all the systems implement two levels
    of thermal protection: the first one is performed by firmware, the
    second, in case firmware was not able to perform protection, by
    hardware, while the temperature threshold for hardware protection is
    higher than for firmware.
    
    In both cases, when critical temperature is reached, system will be
    shutdown.
    
    Signed-off-by: Vadim Pasternak <[email protected]>
    

It has been verified on Mellanox devices

Signed-off-by: Stephen Sun [email protected]

@paulmenzel
Copy link
Contributor

Thank you for the patch.

  1. Ingetrate → Integrate
  2. Please add more details, where the patch comes from. Best you give the git commit hash from the patch in Linus’ master branch.
  3. Please elaborate, what the problem is and how the patch was tested. (Most could be taken from the patch description I guess.)

@stephenxs stephenxs changed the title [Mellanox] Ingetrate kernel patch for hw-mgmt V.7.0010.2002 [Mellanox] Integrate kernel patch for hw-mgmt V.7.0010.2002 Mar 25, 2021
@stephenxs stephenxs changed the title [Mellanox] Integrate kernel patch for hw-mgmt V.7.0010.2002 Backport patch to remove critical trip point from thermal zones Mar 26, 2021
@stephenxs stephenxs marked this pull request as ready for review March 30, 2021 08:15
@liat-grozovik
Copy link
Collaborator

Thank you for the patch.

  1. Ingetrate → Integrate
  2. Please add more details, where the patch comes from. Best you give the git commit hash from the patch in Linus’ master branch.
  3. Please elaborate, what the problem is and how the patch was tested. (Most could be taken from the patch description I guess.)

This is a work in progress in term of linux main. Our team will do the upstream process soon but this must not delay this update. The reason for that that this is a real production issue and we cannot wait for this one.

As for the information you are looking for, it is all in the patch itself.

"
Disable software thermal protection by removing critical trip points
for the all thermal zones.

According to the system requirements software should never perform
system thermal protection, since all the systems implement two levels
of thermal protection: the first one is performed by firmware, the
second, in case firmware was not able to perform protection, by
hardware, while the temperature threshold for hardware protection is
higher than for firmware.

In both cases, when critical temperature is reached, system will be
shutdown.
"

This is relevant to only Mellanox switches as the main protection is done by Firmware.

@stephenxs stephenxs changed the title Backport patch to remove critical trip point from thermal zones [Mellanox] Backport patch to remove critical trip point from thermal zones Mar 30, 2021
@paulmenzel
Copy link
Contributor

The commit in this merge/pull request still has the typos and a to-be-desired commit message. It’s rude to expect from reviewers and later people looking through commits to look at the diff to find out what is done.

If this is important to you, then I suggest you make sure, that all the formal requirements for patches are met.

For upstream, if it’s done later, please add a comment to the commit message, that it is going to be upstreamed soon.

…zones

1. 0027-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch

    Disable software thermal protection by removing critical trip points
    for the all thermal zones.

    According to the system requirements software should never perform
    system thermal protection, since all the systems implement two levels
    of thermal protection: the first one is performed by firmware, the
    second, in case firmware was not able to perform protection, by
    hardware, while the temperature threshold for hardware protection is
    higher than for firmware.

    In both cases, when critical temperature is reached, system will be
    shutdown.

    Signed-off-by: Vadim Pasternak <[email protected]>

Signed-off-by: Stephen Sun <[email protected]>
@stephenxs
Copy link
Contributor Author

The commit in this merge/pull request still has the typos and a to-be-desired commit message. It’s rude to expect from reviewers and later people looking through commits to look at the diff to find out what is done.

If this is important to you, then I suggest you make sure, that all the formal requirements for patches are met.

For upstream, if it’s done later, please add a comment to the commit message, that it is going to be upstreamed soon.

Hi @paulmenzel
Thank you for your comments. I updated the commit message as well as the PR description.

Copy link
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Mellanox devices” is a little broad, but maybe you have a QA system, which tests the SONiC buidls on all devices.

@stephenxs
Copy link
Contributor Author

“Mellanox devices” is a little broad, but maybe you have a QA system, which tests the SONiC buidls on all devices.

Hi paulmenzel
Thank you for your comments.
Yes, we verify all platforms supported in SONiC when integrating a new kernel patch.

@dprital
Copy link
Collaborator

dprital commented Mar 31, 2021

@paulmenzel - Can you please merge this PR ?

@paulmenzel
Copy link
Contributor

I would, if I could, but I do not have the permission to merge the requests. At least @lguohan has the permissions.

@dprital
Copy link
Collaborator

dprital commented Mar 31, 2021

@lguohan - Can you please merge this PR ? and also add the tag that is required for 202012 ? Thanks.

@lguohan lguohan merged commit deddc61 into sonic-net:master Apr 1, 2021
lguohan pushed a commit that referenced this pull request Apr 1, 2021
…zones (#201)

1. 0027-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch

    Disable software thermal protection by removing critical trip points
    for the all thermal zones.

    According to the system requirements software should never perform
    system thermal protection, since all the systems implement two levels
    of thermal protection: the first one is performed by firmware, the
    second, in case firmware was not able to perform protection, by
    hardware, while the temperature threshold for hardware protection is
    higher than for firmware.

    In both cases, when critical temperature is reached, system will be
    shutdown.

    Signed-off-by: Vadim Pasternak <[email protected]>

It has been verified on Mellanox devices

Signed-off-by: Stephen Sun <[email protected]>
@stephenxs stephenxs deleted the hw-mgmt.v.7.0010.2000-bf1 branch April 1, 2021 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants