Skip to content

Mellanox/nic-configuration-operator

Repository files navigation

License Go Report Card Coverage Status Build, Test, Lint CodeQL Image push

NVIDIA Nic Configuration Operator

NVIDIA Nic Configuration Operator provides Kubernetes API(Custom Resource Definition) to allow FW configuration on Nvidia NICs in a coordinated manner. It deploys a configuration daemon on each of the desired nodes to configure Nvidia NICs there. NVIDIA Nic Configuration operator uses maintenance operator to prepare a node for maintenance before the actual configuration.

Deployment

Prerequisites

Helm

Deploy latest from project sources

# Clone project
git clone https://github.com/Mellanox/nic-configuration-operator.git ; cd nic-configuration-operator

# Install Operator
helm install -n nic-configuration-operator --create-namespace --set operator.image.tag=latest nic-configuration ./deployment/nic-configuration-operator-chart

# View deployed resources
kubectl -n nic-configuration-operator get all

Note

Refer to helm values documentation for more information

Deploy last release from OCI repo

helm install -n nic-configuration-operator --create-namespace nic-configuration-operator oci://ghcr.io/mellanox/nic-configuration-operator-chart

CRDs

NICConfigurationTemplate

The NICConfigurationTemplate CRD is used to request FW configuration for a subset of devices

Nic Configuration Operator will select NIC devices in the cluster that match the template's selectors and apply the configuration spec to them.

If more than one template match a single device, none will be applied and the error will be reported in all of their statuses.

for more information refer to api-reference.

Example NICConfigurationTemplate

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NICConfigurationTemplate
metadata:
   name: connectx6-config
   namespace: nic-configuration-operator
spec:
   nodeSelector:
      feature.node.kubernetes.io/network-sriov.capable: "true"
   nicSelector:
      # nicType selector is mandatory the rest are optional only a single type can be specified.
      nicType: 101b
      pciAddress:
         - "0000:03:00.0"
         - “0000:04:00.0”
      serialNumbers:
         - "MT2116X09299"
   resetToDefault: false # if set, template is ignored, device configuration should reset
   template:
      numVfs: 2
      linkType: Ethernet
      pciPerformanceOptimized:
         enabled: true
         maxAccOutRead: 44
         maxReadRequest: 5
      roceOptimized:
         enabled: true
         qos:
            trust: dscp
            pfc: "0,0,0,1,0,0,0,0"
      gpuDirectOptimized:
         enabled: true
         env: Baremetal
      rawNvConfig:
         THIS_IS_A_SPECIAL_NVCONFIG_PARAM: "55"
         SOME_ADVANCED_NVCONFIG_PARAM: "true"

Configuration details

  • numVFs: if provided, configure SR-IOV VFs via nvconfig.
    • E.g: if numVFs=2 then SRIOV_EN=1 and SRIOV_NUM_OF_VFS=2.
    • If numVFs=0 then SRIOV_EN=0 and SRIOV_NUM_OF_VFS=0.
  • linkType: if provided configure linkType for the NIC for all NIC ports.
    • E.g linkType = Infiniband then set LINK_TYPE_P1=IB and LINK_TYPE_P2=IB if second PCI function is present
  • pciPerformanceOptimized: performs PCI performance optimizations. If enabled then by default the following will happen:
    • Set nvconfig MAX_ACC_OUT_READ nvconfig parameter.
    • Set the value of MAX_ACC_OUT_READ to 44 if PCI link is gen4
    • Set the value of MAX_ACC_OUT_READ to 0 (use device defaults) if PCI link is gen5 or newer
    • Set PCI max read request size for each PF to 4096 (note: this is a runtime config and is not persistent)
    • Users can override values via maxAccOutRead and maxReadRequest
  • roceOptimized: performs RoCE related optimizations. If enabled performs the following by default:
    • Nvconfig set for both ports (can be applied from PF0)
      • Conditionally applied for second port if present
        • ROCE_CC_PRIO_MASK_P1=255, ROCE_CC_PRIO_MASK_P2=255
        • CNP_DSCP_P1=4, CNP_DSCP_P2=4
        • CNP_802P_PRIO_P1=6, CNP_802P_PRIO_P2=6
    • Configure pfc (Priority Flow Control) for priority 3 and set trust to dscp on each PF
      • Non-persistent (need to be applied after each boot)
      • Users can override values via trust and pfc parameters
  • gpuDirectOptimized: performs gpu direct optimizations. ATM only optimizations for Baremetal environment are supported. If enabled perform the following:
    • Set nvconfig ATS_ENABLED=0
    • Can only be enabled when pciPerformanceOptimized is enabled
  • rawNvConfig: a map[string]string which contains NVConfig parameters to apply for a NIC on all of its PFs.
    • For per port parameters (suffix _P1, _P2) parameters with _P2 suffix are ignored if the device is single port.

NicDevice

The NicDevice CRD is created automatically by the configuration daemon and represents a specific NVIDIA NIC on a specific K8s node. The name of the device combines the node name, device type and its serial number for easier tracking.

ConfigUpdateInProgress status condition can be used for tracking the state of the FW configuration update on a specific device. If an error occurs during FW configuration update, it will be reflected in this field.

for more information refer to api-reference.

Example NicDevice

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicDevice
metadata:
   name: co-node-25-101b-mt2232t13210
   namespace: nic-configuration-operator
spec:
   configuration:
      template:
         linkType: Ethernet
         numVfs: 8
         pciPerformanceOptimized:
            enabled: true
         rawNvConfig:
            - name: TLS_OPTIMIZE
              value: "1"
status:
   conditions:
      - reason: UpdateSuccessful
        status: "False"
        type: ConfigUpdateInProgress
   firmwareVersion: 20.42.1000
   node: co-node-25
   partNumber: mcx632312a-hdat
   ports:
      - networkInterface: enp4s0f0np0
        pci: "0000:04:00.0"
        rdmaInterface: mlx5_0
      - networkInterface: enp4s0f1np1
        pci: "0000:04:00.1"
        rdmaInterface: mlx5_1
   psid: mt_0000000225
   serialNumber: mt2232t13210
   type: 101b

Implementation details:

The NicDevice CRD is created and reconciled by the configuration daemon. The reconciliation logic scheme can be found here.