From 839e3f9bee425c90a0423d14b102a42fe6635c73 Mon Sep 17 00:00:00 2001 From: Michal Swiatkowski Date: Mon, 19 Aug 2024 12:14:01 +0200 Subject: [PATCH 01/87] ice: set correct dst VSI in only LAN filters The filters set that will reproduce the problem: $ tc filter add dev $VF0_PR ingress protocol arp prio 0 flower \ skip_sw dst_mac ff:ff:ff:ff:ff:ff action mirred egress \ redirect dev $PF0 $ tc filter add dev $VF0_PR ingress protocol arp prio 0 flower \ skip_sw dst_mac ff:ff:ff:ff:ff:ff src_mac 52:54:00:00:00:10 \ action mirred egress mirror dev $VF1_PR Expected behaviour is to set all broadcast from VF0 to the LAN. If the src_mac match the value from filters, send packet to LAN and to VF1. In this case both LAN_EN and LB_EN flags in switch is set in case of packet matching both filters. As dst VSI for the only LAN enable bit is PF VSI, the packet is being seen on PF. To fix this change dst VSI to the source VSI. It will block receiving any packet even when LB_EN is set by switch, because local loopback is clear on VF VSI during normal operation. Side note: if the second filters action is redirect instead of mirror LAN_EN is clear, because switch is AND-ing LAN_EN from each matched filters and OR-ing LB_EN. Reviewed-by: Przemek Kitszel Fixes: 73b483b79029 ("ice: Manage act flags for switchdev offloads") Signed-off-by: Michal Swiatkowski Reviewed-by: Jacob Keller Tested-by: Sujai Buvaneswaran Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_tc_lib.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_tc_lib.c b/drivers/net/ethernet/intel/ice/ice_tc_lib.c index e6923f8121a994..ea39b999a0d000 100644 --- a/drivers/net/ethernet/intel/ice/ice_tc_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_tc_lib.c @@ -819,6 +819,17 @@ ice_eswitch_add_tc_fltr(struct ice_vsi *vsi, struct ice_tc_flower_fltr *fltr) rule_info.sw_act.flag |= ICE_FLTR_TX; rule_info.sw_act.src = vsi->idx; rule_info.flags_info.act = ICE_SINGLE_ACT_LAN_ENABLE; + /* This is a specific case. The destination VSI index is + * overwritten by the source VSI index. This type of filter + * should allow the packet to go to the LAN, not to the + * VSI passed here. It should set LAN_EN bit only. However, + * the VSI must be a valid one. Setting source VSI index + * here is safe. Even if the result from switch is set LAN_EN + * and LB_EN (which normally will pass the packet to this VSI) + * packet won't be seen on the VSI, because local loopback is + * turned off. + */ + rule_info.sw_act.vsi_handle = vsi->idx; } else { /* VF to VF */ rule_info.sw_act.flag |= ICE_FLTR_TX; From ccca30a18e36a742e606d5bf0630e75be7711d0a Mon Sep 17 00:00:00 2001 From: Gui-Dong Han Date: Tue, 3 Sep 2024 11:48:43 +0000 Subject: [PATCH 02/87] ice: Fix improper handling of refcount in ice_dpll_init_rclk_pins() This patch addresses a reference count handling issue in the ice_dpll_init_rclk_pins() function. The function calls ice_dpll_get_pins(), which increments the reference count of the relevant resources. However, if the condition WARN_ON((!vsi || !vsi->netdev)) is met, the function currently returns an error without properly releasing the resources acquired by ice_dpll_get_pins(), leading to a reference count leak. To resolve this, the check has been moved to the top of the function. This ensures that the function verifies the state before any resources are acquired, avoiding the need for additional resource management in the error path. This bug was identified by an experimental static analysis tool developed by our team. The tool specializes in analyzing reference count operations and detecting potential issues where resources are not properly managed. In this case, the tool flagged the missing release operation as a potential problem, which led to the development of this patch. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han Reviewed-by: Simon Horman Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_dpll.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.c b/drivers/net/ethernet/intel/ice/ice_dpll.c index cd95705d1e7fd6..8b6dc4d54fdc72 100644 --- a/drivers/net/ethernet/intel/ice/ice_dpll.c +++ b/drivers/net/ethernet/intel/ice/ice_dpll.c @@ -1843,6 +1843,8 @@ ice_dpll_init_rclk_pins(struct ice_pf *pf, struct ice_dpll_pin *pin, struct dpll_pin *parent; int ret, i; + if (WARN_ON((!vsi || !vsi->netdev))) + return -EINVAL; ret = ice_dpll_get_pins(pf, pin, start_idx, ICE_DPLL_RCLK_NUM_PER_PF, pf->dplls.clock_id); if (ret) @@ -1858,8 +1860,6 @@ ice_dpll_init_rclk_pins(struct ice_pf *pf, struct ice_dpll_pin *pin, if (ret) goto unregister_pins; } - if (WARN_ON((!vsi || !vsi->netdev))) - return -EINVAL; dpll_netdev_pin_set(vsi->netdev, pf->dplls.rclk.pin); return 0; From d517cf89874c6039e6294b18d66f40988e62502a Mon Sep 17 00:00:00 2001 From: Gui-Dong Han Date: Tue, 3 Sep 2024 11:59:43 +0000 Subject: [PATCH 03/87] ice: Fix improper handling of refcount in ice_sriov_set_msix_vec_count() This patch addresses an issue with improper reference count handling in the ice_sriov_set_msix_vec_count() function. First, the function calls ice_get_vf_by_id(), which increments the reference count of the vf pointer. If the subsequent call to ice_get_vf_vsi() fails, the function currently returns an error without decrementing the reference count of the vf pointer, leading to a reference count leak. The correct behavior, as implemented in this patch, is to decrement the reference count using ice_put_vf(vf) before returning an error when vsi is NULL. Second, the function calls ice_sriov_get_irqs(), which sets vf->first_vector_idx. If this call returns a negative value, indicating an error, the function returns an error without decrementing the reference count of the vf pointer, resulting in another reference count leak. The patch addresses this by adding a call to ice_put_vf(vf) before returning an error when vf->first_vector_idx < 0. This bug was identified by an experimental static analysis tool developed by our team. The tool specializes in analyzing reference count operations and identifying potential mismanagement of reference counts. In this case, the tool flagged the missing decrement operation as a potential issue, leading to this patch. Fixes: 4035c72dc1ba ("ice: reconfig host after changing MSI-X on VF") Fixes: 4d38cb44bd32 ("ice: manage VFs MSI-X using resource tracking") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han Reviewed-by: Simon Horman Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_sriov.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c index e34fe2516cccf2..c2d6b2a144e940 100644 --- a/drivers/net/ethernet/intel/ice/ice_sriov.c +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c @@ -1096,8 +1096,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) return -ENOENT; vsi = ice_get_vf_vsi(vf); - if (!vsi) + if (!vsi) { + ice_put_vf(vf); return -ENOENT; + } prev_msix = vf->num_msix; prev_queues = vf->num_vf_qs; @@ -1142,8 +1144,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) vf->num_msix = prev_msix; vf->num_vf_qs = prev_queues; vf->first_vector_idx = ice_sriov_get_irqs(pf, vf->num_msix); - if (vf->first_vector_idx < 0) + if (vf->first_vector_idx < 0) { + ice_put_vf(vf); return -EINVAL; + } if (needs_rebuild) { ice_vf_reconfig_vsi(vf); From d019b1a9128d65956f04679ec2bb8b0800f13358 Mon Sep 17 00:00:00 2001 From: Michal Swiatkowski Date: Fri, 6 Sep 2024 14:57:06 +0200 Subject: [PATCH 04/87] ice: clear port vlan config during reset Since commit 2a2cb4c6c181 ("ice: replace ice_vf_recreate_vsi() with ice_vf_reconfig_vsi()") VF VSI is only reconfigured instead of recreated. The context configuration from previous setting is still the same. If any of the config needs to be cleared it needs to be cleared explicitly. Previously there was assumption that port vlan will be cleared automatically. Now, when VSI is only reconfigured we have to do it in the code. Not clearing port vlan configuration leads to situation when the driver VSI config is different than the VSI config in HW. Traffic can't be passed after setting and clearing port vlan, because of invalid VSI config in HW. Example reproduction: > ip a a dev $(VF) $(VF_IP_ADDRESS) > ip l s dev $(VF) up > ping $(VF_IP_ADDRESS) ping is working fine here > ip link set eth5 vf 0 vlan 100 > ip link set eth5 vf 0 vlan 0 > ping $(VF_IP_ADDRESS) ping isn't working Fixes: 2a2cb4c6c181 ("ice: replace ice_vf_recreate_vsi() with ice_vf_reconfig_vsi()") Signed-off-by: Michal Swiatkowski Reviewed-by: Wojciech Drewek Tested-by: Piotr Tyda Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_vf_lib.c | 7 +++ .../net/ethernet/intel/ice/ice_vsi_vlan_lib.c | 57 +++++++++++++++++++ .../net/ethernet/intel/ice/ice_vsi_vlan_lib.h | 1 + 3 files changed, 65 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c index a69e91f88d8118..749a08ccf2678b 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -335,6 +335,13 @@ static int ice_vf_rebuild_host_vlan_cfg(struct ice_vf *vf, struct ice_vsi *vsi) err = vlan_ops->add_vlan(vsi, &vf->port_vlan_info); } else { + /* clear possible previous port vlan config */ + err = ice_vsi_clear_port_vlan(vsi); + if (err) { + dev_err(dev, "failed to clear port VLAN via VSI parameters for VF %u, error %d\n", + vf->vf_id, err); + return err; + } err = ice_vsi_add_vlan_zero(vsi); } diff --git a/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.c b/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.c index 6e8f2aab608015..5291f2888ef89a 100644 --- a/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.c @@ -787,3 +787,60 @@ int ice_vsi_clear_outer_port_vlan(struct ice_vsi *vsi) kfree(ctxt); return err; } + +int ice_vsi_clear_port_vlan(struct ice_vsi *vsi) +{ + struct ice_hw *hw = &vsi->back->hw; + struct ice_vsi_ctx *ctxt; + int err; + + ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL); + if (!ctxt) + return -ENOMEM; + + ctxt->info = vsi->info; + + ctxt->info.port_based_outer_vlan = 0; + ctxt->info.port_based_inner_vlan = 0; + + ctxt->info.inner_vlan_flags = + FIELD_PREP(ICE_AQ_VSI_INNER_VLAN_TX_MODE_M, + ICE_AQ_VSI_INNER_VLAN_TX_MODE_ALL); + if (ice_is_dvm_ena(hw)) { + ctxt->info.inner_vlan_flags |= + FIELD_PREP(ICE_AQ_VSI_INNER_VLAN_EMODE_M, + ICE_AQ_VSI_INNER_VLAN_EMODE_NOTHING); + ctxt->info.outer_vlan_flags = + FIELD_PREP(ICE_AQ_VSI_OUTER_VLAN_TX_MODE_M, + ICE_AQ_VSI_OUTER_VLAN_TX_MODE_ALL); + ctxt->info.outer_vlan_flags |= + FIELD_PREP(ICE_AQ_VSI_OUTER_TAG_TYPE_M, + ICE_AQ_VSI_OUTER_TAG_VLAN_8100); + ctxt->info.outer_vlan_flags |= + ICE_AQ_VSI_OUTER_VLAN_EMODE_NOTHING << + ICE_AQ_VSI_OUTER_VLAN_EMODE_S; + } + + ctxt->info.sw_flags2 &= ~ICE_AQ_VSI_SW_FLAG_RX_VLAN_PRUNE_ENA; + ctxt->info.valid_sections = + cpu_to_le16(ICE_AQ_VSI_PROP_OUTER_TAG_VALID | + ICE_AQ_VSI_PROP_VLAN_VALID | + ICE_AQ_VSI_PROP_SW_VALID); + + err = ice_update_vsi(hw, vsi->idx, ctxt, NULL); + if (err) { + dev_err(ice_pf_to_dev(vsi->back), "update VSI for clearing port based VLAN failed, err %d aq_err %s\n", + err, ice_aq_str(hw->adminq.sq_last_status)); + } else { + vsi->info.port_based_outer_vlan = + ctxt->info.port_based_outer_vlan; + vsi->info.port_based_inner_vlan = + ctxt->info.port_based_inner_vlan; + vsi->info.outer_vlan_flags = ctxt->info.outer_vlan_flags; + vsi->info.inner_vlan_flags = ctxt->info.inner_vlan_flags; + vsi->info.sw_flags2 = ctxt->info.sw_flags2; + } + + kfree(ctxt); + return err; +} diff --git a/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.h b/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.h index f0d84d11bd5b1f..12b227621a7ddc 100644 --- a/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.h +++ b/drivers/net/ethernet/intel/ice/ice_vsi_vlan_lib.h @@ -36,5 +36,6 @@ int ice_vsi_ena_outer_insertion(struct ice_vsi *vsi, u16 tpid); int ice_vsi_dis_outer_insertion(struct ice_vsi *vsi); int ice_vsi_set_outer_port_vlan(struct ice_vsi *vsi, struct ice_vlan *vlan); int ice_vsi_clear_outer_port_vlan(struct ice_vsi *vsi); +int ice_vsi_clear_port_vlan(struct ice_vsi *vsi); #endif /* _ICE_VSI_VLAN_LIB_H_ */ From c188afdc36113760873ec78cbc036f6b05f77621 Mon Sep 17 00:00:00 2001 From: Przemek Kitszel Date: Tue, 10 Sep 2024 15:57:21 +0200 Subject: [PATCH 05/87] ice: fix memleak in ice_init_tx_topology() Fix leak of the FW blob (DDP pkg). Make ice_cfg_tx_topo() const-correct, so ice_init_tx_topology() can avoid copying whole FW blob. Copy just the topology section, and only when needed. Reuse the buffer allocated for the read of the current topology. This was found by kmemleak, with the following trace for each PF: [] kmemdup_noprof+0x1d/0x50 [] ice_init_ddp_config+0x100/0x220 [ice] [] ice_init_dev+0x6f/0x200 [ice] [] ice_init+0x29/0x560 [ice] [] ice_probe+0x21d/0x310 [ice] Constify ice_cfg_tx_topo() @buf parameter. This cascades further down to few more functions. Fixes: cc5776fe1832 ("ice: Enable switching default Tx scheduler topology") CC: Larysa Zaremba CC: Jacob Keller CC: Pucha Himasekhar Reddy CC: Mateusz Polchlopek Signed-off-by: Przemek Kitszel Reviewed-by: Jacob Keller Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_ddp.c | 58 +++++++++++------------ drivers/net/ethernet/intel/ice/ice_ddp.h | 4 +- drivers/net/ethernet/intel/ice/ice_main.c | 8 +--- 3 files changed, 31 insertions(+), 39 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ddp.c b/drivers/net/ethernet/intel/ice/ice_ddp.c index 953262b88a586f..272fd823a825d0 100644 --- a/drivers/net/ethernet/intel/ice/ice_ddp.c +++ b/drivers/net/ethernet/intel/ice/ice_ddp.c @@ -31,7 +31,7 @@ static const struct ice_tunnel_type_scan tnls[] = { * Verifies various attributes of the package file, including length, format * version, and the requirement of at least one segment. */ -static enum ice_ddp_state ice_verify_pkg(struct ice_pkg_hdr *pkg, u32 len) +static enum ice_ddp_state ice_verify_pkg(const struct ice_pkg_hdr *pkg, u32 len) { u32 seg_count; u32 i; @@ -57,13 +57,13 @@ static enum ice_ddp_state ice_verify_pkg(struct ice_pkg_hdr *pkg, u32 len) /* all segments must fit within length */ for (i = 0; i < seg_count; i++) { u32 off = le32_to_cpu(pkg->seg_offset[i]); - struct ice_generic_seg_hdr *seg; + const struct ice_generic_seg_hdr *seg; /* segment header must fit */ if (len < off + sizeof(*seg)) return ICE_DDP_PKG_INVALID_FILE; - seg = (struct ice_generic_seg_hdr *)((u8 *)pkg + off); + seg = (void *)pkg + off; /* segment body must fit */ if (len < off + le32_to_cpu(seg->seg_size)) @@ -119,13 +119,13 @@ static enum ice_ddp_state ice_chk_pkg_version(struct ice_pkg_ver *pkg_ver) * * This helper function validates a buffer's header. */ -static struct ice_buf_hdr *ice_pkg_val_buf(struct ice_buf *buf) +static const struct ice_buf_hdr *ice_pkg_val_buf(const struct ice_buf *buf) { - struct ice_buf_hdr *hdr; + const struct ice_buf_hdr *hdr; u16 section_count; u16 data_end; - hdr = (struct ice_buf_hdr *)buf->buf; + hdr = (const struct ice_buf_hdr *)buf->buf; /* verify data */ section_count = le16_to_cpu(hdr->section_count); if (section_count < ICE_MIN_S_COUNT || section_count > ICE_MAX_S_COUNT) @@ -165,8 +165,8 @@ static struct ice_buf_table *ice_find_buf_table(struct ice_seg *ice_seg) * unexpected value has been detected (for example an invalid section count or * an invalid buffer end value). */ -static struct ice_buf_hdr *ice_pkg_enum_buf(struct ice_seg *ice_seg, - struct ice_pkg_enum *state) +static const struct ice_buf_hdr *ice_pkg_enum_buf(struct ice_seg *ice_seg, + struct ice_pkg_enum *state) { if (ice_seg) { state->buf_table = ice_find_buf_table(ice_seg); @@ -1800,9 +1800,9 @@ int ice_update_pkg(struct ice_hw *hw, struct ice_buf *bufs, u32 count) * success it returns a pointer to the segment header, otherwise it will * return NULL. */ -static struct ice_generic_seg_hdr * +static const struct ice_generic_seg_hdr * ice_find_seg_in_pkg(struct ice_hw *hw, u32 seg_type, - struct ice_pkg_hdr *pkg_hdr) + const struct ice_pkg_hdr *pkg_hdr) { u32 i; @@ -1813,11 +1813,9 @@ ice_find_seg_in_pkg(struct ice_hw *hw, u32 seg_type, /* Search all package segments for the requested segment type */ for (i = 0; i < le32_to_cpu(pkg_hdr->seg_count); i++) { - struct ice_generic_seg_hdr *seg; + const struct ice_generic_seg_hdr *seg; - seg = (struct ice_generic_seg_hdr - *)((u8 *)pkg_hdr + - le32_to_cpu(pkg_hdr->seg_offset[i])); + seg = (void *)pkg_hdr + le32_to_cpu(pkg_hdr->seg_offset[i]); if (le32_to_cpu(seg->seg_type) == seg_type) return seg; @@ -2354,12 +2352,12 @@ ice_get_set_tx_topo(struct ice_hw *hw, u8 *buf, u16 buf_size, * * Return: zero when update was successful, negative values otherwise. */ -int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len) +int ice_cfg_tx_topo(struct ice_hw *hw, const void *buf, u32 len) { - u8 *current_topo, *new_topo = NULL; - struct ice_run_time_cfg_seg *seg; - struct ice_buf_hdr *section; - struct ice_pkg_hdr *pkg_hdr; + u8 *new_topo = NULL, *topo __free(kfree) = NULL; + const struct ice_run_time_cfg_seg *seg; + const struct ice_buf_hdr *section; + const struct ice_pkg_hdr *pkg_hdr; enum ice_ddp_state state; u16 offset, size = 0; u32 reg = 0; @@ -2375,15 +2373,13 @@ int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len) return -EOPNOTSUPP; } - current_topo = kzalloc(ICE_AQ_MAX_BUF_LEN, GFP_KERNEL); - if (!current_topo) + topo = kzalloc(ICE_AQ_MAX_BUF_LEN, GFP_KERNEL); + if (!topo) return -ENOMEM; - /* Get the current Tx topology */ - status = ice_get_set_tx_topo(hw, current_topo, ICE_AQ_MAX_BUF_LEN, NULL, - &flags, false); - - kfree(current_topo); + /* Get the current Tx topology flags */ + status = ice_get_set_tx_topo(hw, topo, ICE_AQ_MAX_BUF_LEN, NULL, &flags, + false); if (status) { ice_debug(hw, ICE_DBG_INIT, "Get current topology is failed\n"); @@ -2419,7 +2415,7 @@ int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len) goto update_topo; } - pkg_hdr = (struct ice_pkg_hdr *)buf; + pkg_hdr = (const struct ice_pkg_hdr *)buf; state = ice_verify_pkg(pkg_hdr, len); if (state) { ice_debug(hw, ICE_DBG_INIT, "Failed to verify pkg (err: %d)\n", @@ -2428,7 +2424,7 @@ int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len) } /* Find runtime configuration segment */ - seg = (struct ice_run_time_cfg_seg *) + seg = (const struct ice_run_time_cfg_seg *) ice_find_seg_in_pkg(hw, SEGMENT_TYPE_ICE_RUN_TIME_CFG, pkg_hdr); if (!seg) { ice_debug(hw, ICE_DBG_INIT, "5 layer topology segment is missing\n"); @@ -2461,8 +2457,10 @@ int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len) return -EIO; } - /* Get the new topology buffer */ - new_topo = ((u8 *)section) + offset; + /* Get the new topology buffer, reuse current topo copy mem */ + static_assert(ICE_PKG_BUF_SIZE == ICE_AQ_MAX_BUF_LEN); + new_topo = topo; + memcpy(new_topo, (u8 *)section + offset, size); update_topo: /* Acquire global lock to make sure that set topology issued diff --git a/drivers/net/ethernet/intel/ice/ice_ddp.h b/drivers/net/ethernet/intel/ice/ice_ddp.h index 97f27231747548..79551da2a4b022 100644 --- a/drivers/net/ethernet/intel/ice/ice_ddp.h +++ b/drivers/net/ethernet/intel/ice/ice_ddp.h @@ -438,7 +438,7 @@ struct ice_pkg_enum { u32 buf_idx; u32 type; - struct ice_buf_hdr *buf; + const struct ice_buf_hdr *buf; u32 sect_idx; void *sect; u32 sect_type; @@ -467,6 +467,6 @@ ice_pkg_enum_entry(struct ice_seg *ice_seg, struct ice_pkg_enum *state, void *ice_pkg_enum_section(struct ice_seg *ice_seg, struct ice_pkg_enum *state, u32 sect_type); -int ice_cfg_tx_topo(struct ice_hw *hw, u8 *buf, u32 len); +int ice_cfg_tx_topo(struct ice_hw *hw, const void *buf, u32 len); #endif diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index eeb48cc48e089f..fbab72fab79ca7 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -4536,16 +4536,10 @@ ice_init_tx_topology(struct ice_hw *hw, const struct firmware *firmware) u8 num_tx_sched_layers = hw->num_tx_sched_layers; struct ice_pf *pf = hw->back; struct device *dev; - u8 *buf_copy; int err; dev = ice_pf_to_dev(pf); - /* ice_cfg_tx_topo buf argument is not a constant, - * so we have to make a copy - */ - buf_copy = kmemdup(firmware->data, firmware->size, GFP_KERNEL); - - err = ice_cfg_tx_topo(hw, buf_copy, firmware->size); + err = ice_cfg_tx_topo(hw, firmware->data, firmware->size); if (!err) { if (hw->num_tx_sched_layers > num_tx_sched_layers) dev_info(dev, "Tx scheduling layers switching feature disabled\n"); From afe6e30e7701979f536f8fbf6fdef7212441f61a Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Thu, 12 Sep 2024 10:54:28 +0200 Subject: [PATCH 06/87] ice: disallow DPLL_PIN_STATE_SELECTABLE for dpll output pins Currently the user may request DPLL_PIN_STATE_SELECTABLE for an output pin, and this would actually set the DISCONNECTED state instead. It doesn't make any sense. SELECTABLE is valid only in case of input pins (on AUTOMATIC type dpll), where dpll itself would select best valid input. For the output pin only CONNECTED/DISCONNECTED are expected. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Reviewed-by: Aleksandr Loktionov Reviewed-by: Paul Menzel Signed-off-by: Arkadiusz Kubalewski Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_dpll.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.c b/drivers/net/ethernet/intel/ice/ice_dpll.c index 8b6dc4d54fdc72..74c0e7319a4cac 100644 --- a/drivers/net/ethernet/intel/ice/ice_dpll.c +++ b/drivers/net/ethernet/intel/ice/ice_dpll.c @@ -656,6 +656,8 @@ ice_dpll_output_state_set(const struct dpll_pin *pin, void *pin_priv, struct ice_dpll_pin *p = pin_priv; struct ice_dpll *d = dpll_priv; + if (state == DPLL_PIN_STATE_SELECTABLE) + return -EINVAL; if (!enable && p->state[d->dpll_idx] == DPLL_PIN_STATE_DISCONNECTED) return 0; From 0eae2c136cb624e4050092feb59f18159b4f2512 Mon Sep 17 00:00:00 2001 From: Dave Ertman Date: Wed, 18 Sep 2024 14:02:56 -0400 Subject: [PATCH 07/87] ice: fix VLAN replay after reset There is a bug currently when there are more than one VLAN defined and any reset that affects the PF is initiated, after the reset rebuild no traffic will pass on any VLAN but the last one created. This is caused by the iteration though the VLANs during replay each clearing the vsi_map bitmap of the VSI that is being replayed. The problem is that during rhe replay, the pointer to the vsi_map bitmap is used by each successive vlan to determine if it should be replayed on this VSI. The logic was that the replay of the VLAN would replace the bit in the map before the next VLAN would iterate through. But, since the replay copies the old bitmap pointer to filt_replay_rules and creates a new one for the recreated VLANS, it does not do this, and leaves the old bitmap broken to be used to replay the remaining VLANs. Since the old bitmap will be cleaned up in post replay cleanup, there is no need to alter it and break following VLAN replay, so don't clear the bit. Fixes: 334cb0626de1 ("ice: Implement VSI replay framework") Reviewed-by: Przemek Kitszel Signed-off-by: Dave Ertman Reviewed-by: Jacob Keller Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_switch.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_switch.c b/drivers/net/ethernet/intel/ice/ice_switch.c index 79d91e95358ca1..0e740342e2947e 100644 --- a/drivers/net/ethernet/intel/ice/ice_switch.c +++ b/drivers/net/ethernet/intel/ice/ice_switch.c @@ -6322,8 +6322,6 @@ ice_replay_vsi_fltr(struct ice_hw *hw, u16 vsi_handle, u8 recp_id, if (!itr->vsi_list_info || !test_bit(vsi_handle, itr->vsi_list_info->vsi_map)) continue; - /* Clearing it so that the logic can add it back */ - clear_bit(vsi_handle, itr->vsi_list_info->vsi_map); f_entry.fltr_info.vsi_handle = vsi_handle; f_entry.fltr_info.fltr_act = ICE_FWD_TO_VSI; /* update the src in case it is VSI num */ From d382c7bc236d4fc7b087ddad2732c84d222a4dc9 Mon Sep 17 00:00:00 2001 From: Ahmed Zaki Date: Wed, 28 Aug 2024 16:38:25 -0600 Subject: [PATCH 08/87] idpf: fix VF dynamic interrupt ctl register initialization The VF's dynamic interrupt ctl "dyn_ctl_intrvl_s" is not initialized in idpf_vf_intr_reg_init(). This resulted in the following UBSAN error whenever a VF is created: [ 564.345655] UBSAN: shift-out-of-bounds in drivers/net/ethernet/intel/idpf/idpf_txrx.c:3654:10 [ 564.345663] shift exponent 4294967295 is too large for 32-bit type 'int' [ 564.345671] CPU: 33 UID: 0 PID: 2458 Comm: NetworkManager Not tainted 6.11.0-rc4+ #1 [ 564.345678] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C6200.86B.0027.P10.2201070222 01/07/2022 [ 564.345683] Call Trace: [ 564.345688] [ 564.345693] dump_stack_lvl+0x91/0xb0 [ 564.345708] __ubsan_handle_shift_out_of_bounds+0x16b/0x320 [ 564.345730] idpf_vport_intr_update_itr_ena_irq.cold+0x13/0x39 [idpf] [ 564.345755] ? __pfx_idpf_vport_intr_update_itr_ena_irq+0x10/0x10 [idpf] [ 564.345771] ? static_obj+0x95/0xd0 [ 564.345782] ? lockdep_init_map_type+0x1a5/0x800 [ 564.345794] idpf_vport_intr_ena+0x5ef/0x9f0 [idpf] [ 564.345814] idpf_vport_open+0x2cc/0x1240 [idpf] [ 564.345837] idpf_open+0x6d/0xc0 [idpf] [ 564.345850] __dev_open+0x241/0x420 Fixes: d4d558718266 ("idpf: initialize interrupts and enable vport") Reviewed-by: Przemek Kitszel Signed-off-by: Ahmed Zaki Reviewed-by: Simon Horman Tested-by: Krishneil Singh Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/idpf/idpf_vf_dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c b/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c index 99b8dbaf4225c5..aad62e270ae40e 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c +++ b/drivers/net/ethernet/intel/idpf/idpf_vf_dev.c @@ -99,6 +99,7 @@ static int idpf_vf_intr_reg_init(struct idpf_vport *vport) intr->dyn_ctl_intena_m = VF_INT_DYN_CTLN_INTENA_M; intr->dyn_ctl_intena_msk_m = VF_INT_DYN_CTLN_INTENA_MSK_M; intr->dyn_ctl_itridx_s = VF_INT_DYN_CTLN_ITR_INDX_S; + intr->dyn_ctl_intrvl_s = VF_INT_DYN_CTLN_INTERVAL_S; intr->dyn_ctl_wb_on_itr_m = VF_INT_DYN_CTLN_WB_ON_ITR_M; spacing = IDPF_ITR_IDX_SPACING(reg_vals[vec_id].itrn_index_spacing, From 640f70063e6d3a76a63f57e130fba43ba8c7e980 Mon Sep 17 00:00:00 2001 From: Joshua Hay Date: Tue, 3 Sep 2024 11:49:56 -0700 Subject: [PATCH 09/87] idpf: use actual mbx receive payload length When a mailbox message is received, the driver is checking for a non 0 datalen in the controlq descriptor. If it is valid, the payload is attached to the ctlq message to give to the upper layer. However, the payload response size given to the upper layer was taken from the buffer metadata which is _always_ the max buffer size. This meant the API was returning 4K as the payload size for all messages. This went unnoticed since the virtchnl exchange response logic was checking for a response size less than 0 (error), not less than exact size, or not greater than or equal to the max mailbox buffer size (4K). All of these checks will pass in the success case since the size provided is always 4K. However, this breaks anyone that wants to validate the exact response size. Fetch the actual payload length from the value provided in the descriptor data_len field (instead of the buffer metadata). Unfortunately, this means we lose some extra error parsing for variable sized virtchnl responses such as create vport and get ptypes. However, the original checks weren't really helping anyways since the size was _always_ 4K. Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org # 6.9+ Signed-off-by: Joshua Hay Reviewed-by: Przemek Kitszel Tested-by: Krishneil Singh Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c index 70986e12da28e3..3c0f97650d72fd 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c @@ -666,7 +666,7 @@ idpf_vc_xn_forward_reply(struct idpf_adapter *adapter, if (ctlq_msg->data_len) { payload = ctlq_msg->ctx.indirect.payload->va; - payload_size = ctlq_msg->ctx.indirect.payload->size; + payload_size = ctlq_msg->data_len; } xn->reply_sz = payload_size; @@ -1295,10 +1295,6 @@ int idpf_send_create_vport_msg(struct idpf_adapter *adapter, err = reply_sz; goto free_vport_params; } - if (reply_sz < IDPF_CTLQ_MAX_BUF_LEN) { - err = -EIO; - goto free_vport_params; - } return 0; @@ -2602,9 +2598,6 @@ int idpf_send_get_rx_ptype_msg(struct idpf_vport *vport) if (reply_sz < 0) return reply_sz; - if (reply_sz < IDPF_CTLQ_MAX_BUF_LEN) - return -EIO; - ptypes_recvd += le16_to_cpu(ptype_info->num_ptypes); if (ptypes_recvd > max_ptype) return -EINVAL; From 09d0fb5cb30ebcaed4a33028ae383f5a1463e2b2 Mon Sep 17 00:00:00 2001 From: Larysa Zaremba Date: Wed, 4 Sep 2024 11:54:17 +0200 Subject: [PATCH 10/87] idpf: deinit virtchnl transaction manager after vport and vectors When the device is removed, idpf is supposed to make certain virtchnl requests e.g. VIRTCHNL2_OP_DEALLOC_VECTORS and VIRTCHNL2_OP_DESTROY_VPORT. However, this does not happen due to the referenced commit introducing virtchnl transaction manager and placing its deinitialization before those messages are sent. Then the sending is impossible due to no transactions being available. Lack of cleanup can lead to the FW becoming unresponsive from e.g. unloading-loading the driver and creating-destroying VFs afterwards. Move transaction manager deinitialization to after other virtchnl-related cleanup is done. Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Reviewed-by: Przemek Kitszel Signed-off-by: Larysa Zaremba Tested-by: Krishneil Singh Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c index 3c0f97650d72fd..15c00a01f1c0b9 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c @@ -3081,9 +3081,9 @@ void idpf_vc_core_deinit(struct idpf_adapter *adapter) if (!test_bit(IDPF_VC_CORE_INIT, adapter->flags)) return; - idpf_vc_xn_shutdown(adapter->vcxn_mngr); idpf_deinit_task(adapter); idpf_intr_rel(adapter); + idpf_vc_xn_shutdown(adapter->vcxn_mngr); cancel_delayed_work_sync(&adapter->serv_task); cancel_delayed_work_sync(&adapter->mbx_task); From a842e443ca8184f2dc82ab307b43a8b38defd6a5 Mon Sep 17 00:00:00 2001 From: Ingo van Lil Date: Wed, 2 Oct 2024 18:18:07 +0200 Subject: [PATCH 11/87] net: phy: dp83869: fix memory corruption when enabling fiber When configuring the fiber port, the DP83869 PHY driver incorrectly calls linkmode_set_bit() with a bit mask (1 << 10) rather than a bit number (10). This corrupts some other memory location -- in case of arm64 the priv pointer in the same structure. Since the advertising flags are updated from supported at the end of the function the incorrect line isn't needed at all and can be removed. Fixes: a29de52ba2a1 ("net: dp83869: Add ability to advertise Fiber connection") Signed-off-by: Ingo van Lil Reviewed-by: Alexander Sverdlin Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20241002161807.440378-1-inguin@gmx.de Signed-off-by: Jakub Kicinski --- drivers/net/phy/dp83869.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/phy/dp83869.c b/drivers/net/phy/dp83869.c index d7aaefb5226b62..5f056d7db83eed 100644 --- a/drivers/net/phy/dp83869.c +++ b/drivers/net/phy/dp83869.c @@ -645,7 +645,6 @@ static int dp83869_configure_fiber(struct phy_device *phydev, phydev->supported); linkmode_set_bit(ETHTOOL_LINK_MODE_FIBRE_BIT, phydev->supported); - linkmode_set_bit(ADVERTISED_FIBRE, phydev->advertising); if (dp83869->mode == DP83869_RGMII_1000_BASE) { linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT, From 55e802468e1d38dec8e25a2fdb6078d45b647e8c Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 2 Oct 2024 14:58:37 +0200 Subject: [PATCH 12/87] sfc: Don't invoke xdp_do_flush() from netpoll. Yury reported a crash in the sfc driver originated from netpoll_send_udp(). The netconsole sends a message and then netpoll invokes the driver's NAPI function with a budget of zero. It is dedicated to allow driver to free TX resources, that it may have used while sending the packet. In the netpoll case the driver invokes xdp_do_flush() unconditionally, leading to crash because bpf_net_context was never assigned. Invoke xdp_do_flush() only if budget is not zero. Fixes: 401cb7dae8130 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.") Reported-by: Yury Vostrikov Closes: https://lore.kernel.org/5627f6d1-5491-4462-9d75-bc0612c26a22@app.fastmail.com Signed-off-by: Sebastian Andrzej Siewior Reviewed-by: Edward Cree Link: https://patch.msgid.link/20241002125837.utOcRo6Y@linutronix.de Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/sfc/efx_channels.c | 3 ++- drivers/net/ethernet/sfc/siena/efx_channels.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c index c9e17a8208a901..f1723a6fb082b4 100644 --- a/drivers/net/ethernet/sfc/efx_channels.c +++ b/drivers/net/ethernet/sfc/efx_channels.c @@ -1260,7 +1260,8 @@ static int efx_poll(struct napi_struct *napi, int budget) spent = efx_process_channel(channel, budget); - xdp_do_flush(); + if (budget) + xdp_do_flush(); if (spent < budget) { if (efx_channel_has_rx_queue(channel) && diff --git a/drivers/net/ethernet/sfc/siena/efx_channels.c b/drivers/net/ethernet/sfc/siena/efx_channels.c index a7346e965bfe70..d120b3c83ac07d 100644 --- a/drivers/net/ethernet/sfc/siena/efx_channels.c +++ b/drivers/net/ethernet/sfc/siena/efx_channels.c @@ -1285,7 +1285,8 @@ static int efx_poll(struct napi_struct *napi, int budget) spent = efx_process_channel(channel, budget); - xdp_do_flush(); + if (budget) + xdp_do_flush(); if (spent < budget) { if (efx_channel_has_rx_queue(channel) && From 17cbfcdd85f6c93b2e9565d61110ad0b90440436 Mon Sep 17 00:00:00 2001 From: Abhishek Chauhan Date: Tue, 1 Oct 2024 15:46:25 -0700 Subject: [PATCH 13/87] net: phy: aquantia: AQR115c fix up PMA capabilities AQR115c reports incorrect PMA capabilities which includes 10G/5G and also incorrectly disables capabilities like autoneg and 10Mbps support. AQR115c as per the Marvell databook supports speeds up to 2.5Gbps with autonegotiation. Fixes: 0ebc581f8a4b ("net: phy: aquantia: add support for aqr115c") Link: https://lore.kernel.org/all/20240913011635.1286027-1-quic_abchauha@quicinc.com/T/ Signed-off-by: Abhishek Chauhan Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/20241001224626.2400222-2-quic_abchauha@quicinc.com Signed-off-by: Jakub Kicinski --- drivers/net/phy/aquantia/aquantia_main.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c index 4d156d406bab9a..1bb39664a5cb17 100644 --- a/drivers/net/phy/aquantia/aquantia_main.c +++ b/drivers/net/phy/aquantia/aquantia_main.c @@ -731,6 +731,19 @@ static int aqr113c_fill_interface_modes(struct phy_device *phydev) return aqr107_fill_interface_modes(phydev); } +static int aqr115c_get_features(struct phy_device *phydev) +{ + unsigned long *supported = phydev->supported; + + /* PHY supports speeds up to 2.5G with autoneg. PMA capabilities + * are not useful. + */ + linkmode_or(supported, supported, phy_gbit_features); + linkmode_set_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT, supported); + + return 0; +} + static int aqr113c_config_init(struct phy_device *phydev) { int ret; @@ -1046,6 +1059,7 @@ static struct phy_driver aqr_driver[] = { .get_sset_count = aqr107_get_sset_count, .get_strings = aqr107_get_strings, .get_stats = aqr107_get_stats, + .get_features = aqr115c_get_features, .link_change_notify = aqr107_link_change_notify, .led_brightness_set = aqr_phy_led_brightness_set, .led_hw_is_supported = aqr_phy_led_hw_is_supported, From 8f61d73306c62e3c0e368cf6051330f4593415f6 Mon Sep 17 00:00:00 2001 From: Abhishek Chauhan Date: Tue, 1 Oct 2024 15:46:26 -0700 Subject: [PATCH 14/87] net: phy: aquantia: remove usage of phy_set_max_speed Remove the use of phy_set_max_speed in phy driver as the function is mainly used in MAC driver to set the max speed. Instead use get_features to fix up Phy PMA capabilities for AQR111, AQR111B0, AQR114C and AQCS109 Fixes: 038ba1dc4e54 ("net: phy: aquantia: add AQR111 and AQR111B0 PHY ID") Fixes: 0974f1f03b07 ("net: phy: aquantia: remove false 5G and 10G speed ability for AQCS109") Fixes: c278ec644377 ("net: phy: aquantia: add support for AQR114C PHY ID") Link: https://lore.kernel.org/all/20240913011635.1286027-1-quic_abchauha@quicinc.com/T/ Signed-off-by: Abhishek Chauhan Reviewed-by: Russell King (Oracle) Link: https://patch.msgid.link/20241001224626.2400222-3-quic_abchauha@quicinc.com Signed-off-by: Jakub Kicinski --- drivers/net/phy/aquantia/aquantia_main.c | 37 ++++++++++++------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c index 1bb39664a5cb17..c33a5ef34ba032 100644 --- a/drivers/net/phy/aquantia/aquantia_main.c +++ b/drivers/net/phy/aquantia/aquantia_main.c @@ -537,12 +537,6 @@ static int aqcs109_config_init(struct phy_device *phydev) if (!ret) aqr107_chip_info(phydev); - /* AQCS109 belongs to a chip family partially supporting 10G and 5G. - * PMA speed ability bits are the same for all members of the family, - * AQCS109 however supports speeds up to 2.5G only. - */ - phy_set_max_speed(phydev, SPEED_2500); - return aqr107_set_downshift(phydev, MDIO_AN_VEND_PROV_DOWNSHIFT_DFLT); } @@ -744,6 +738,18 @@ static int aqr115c_get_features(struct phy_device *phydev) return 0; } +static int aqr111_get_features(struct phy_device *phydev) +{ + /* PHY supports speeds up to 5G with autoneg. PMA capabilities + * are not useful. + */ + aqr115c_get_features(phydev); + linkmode_set_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT, + phydev->supported); + + return 0; +} + static int aqr113c_config_init(struct phy_device *phydev) { int ret; @@ -780,15 +786,6 @@ static int aqr107_probe(struct phy_device *phydev) return aqr_hwmon_probe(phydev); } -static int aqr111_config_init(struct phy_device *phydev) -{ - /* AQR111 reports supporting speed up to 10G, - * however only speeds up to 5G are supported. - */ - phy_set_max_speed(phydev, SPEED_5000); - - return aqr107_config_init(phydev); -} static struct phy_driver aqr_driver[] = { { @@ -866,6 +863,7 @@ static struct phy_driver aqr_driver[] = { .get_sset_count = aqr107_get_sset_count, .get_strings = aqr107_get_strings, .get_stats = aqr107_get_stats, + .get_features = aqr115c_get_features, .link_change_notify = aqr107_link_change_notify, .led_brightness_set = aqr_phy_led_brightness_set, .led_hw_is_supported = aqr_phy_led_hw_is_supported, @@ -878,7 +876,7 @@ static struct phy_driver aqr_driver[] = { .name = "Aquantia AQR111", .probe = aqr107_probe, .get_rate_matching = aqr107_get_rate_matching, - .config_init = aqr111_config_init, + .config_init = aqr107_config_init, .config_aneg = aqr_config_aneg, .config_intr = aqr_config_intr, .handle_interrupt = aqr_handle_interrupt, @@ -890,6 +888,7 @@ static struct phy_driver aqr_driver[] = { .get_sset_count = aqr107_get_sset_count, .get_strings = aqr107_get_strings, .get_stats = aqr107_get_stats, + .get_features = aqr111_get_features, .link_change_notify = aqr107_link_change_notify, .led_brightness_set = aqr_phy_led_brightness_set, .led_hw_is_supported = aqr_phy_led_hw_is_supported, @@ -902,7 +901,7 @@ static struct phy_driver aqr_driver[] = { .name = "Aquantia AQR111B0", .probe = aqr107_probe, .get_rate_matching = aqr107_get_rate_matching, - .config_init = aqr111_config_init, + .config_init = aqr107_config_init, .config_aneg = aqr_config_aneg, .config_intr = aqr_config_intr, .handle_interrupt = aqr_handle_interrupt, @@ -914,6 +913,7 @@ static struct phy_driver aqr_driver[] = { .get_sset_count = aqr107_get_sset_count, .get_strings = aqr107_get_strings, .get_stats = aqr107_get_stats, + .get_features = aqr111_get_features, .link_change_notify = aqr107_link_change_notify, .led_brightness_set = aqr_phy_led_brightness_set, .led_hw_is_supported = aqr_phy_led_hw_is_supported, @@ -1023,7 +1023,7 @@ static struct phy_driver aqr_driver[] = { .name = "Aquantia AQR114C", .probe = aqr107_probe, .get_rate_matching = aqr107_get_rate_matching, - .config_init = aqr111_config_init, + .config_init = aqr107_config_init, .config_aneg = aqr_config_aneg, .config_intr = aqr_config_intr, .handle_interrupt = aqr_handle_interrupt, @@ -1035,6 +1035,7 @@ static struct phy_driver aqr_driver[] = { .get_sset_count = aqr107_get_sset_count, .get_strings = aqr107_get_strings, .get_stats = aqr107_get_stats, + .get_features = aqr111_get_features, .link_change_notify = aqr107_link_change_notify, .led_brightness_set = aqr_phy_led_brightness_set, .led_hw_is_supported = aqr_phy_led_hw_is_supported, From e37ab7373696e650d3b6262a5b882aadad69bb9e Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Tue, 1 Oct 2024 20:05:15 +0000 Subject: [PATCH 15/87] tcp: fix to allow timestamp undo if no retransmits were sent Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu recently reported that after this commit from 2019: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") ...and before this fix we could have buggy scenarios like the following: + Due to reordering, a TCP connection receives some SACKs and enters a spurious fast recovery. + TSQ prevents all invocations of tcp_retransmit_skb(), because many skbs are queued in lower layers of the sending machine's network stack; thus tp->retrans_stamp remains 0. + The connection receives a TCP timestamp ECR value echoing a timestamp before the fast recovery, indicating that the fast recovery was spurious. + The connection fails to undo the spurious fast recovery because tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false, due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") This fix tweaks the logic to be more similar to the tcp_packet_delayed() logic before bc9f38c8328e, except that we take care not to be fooled by the FLAG_SYN_ACKED code path zeroing out tp->retrans_stamp (the bug noted and fixed by Yuchung in bc9f38c8328e). Note that this returns the high-level behavior of tcp_packet_delayed() to again match the comment for the function, which says: "Nothing was retransmitted or returned timestamp is less than timestamp of the first retransmission." Note that this comment is in the original 2005-04-16 Linux git commit, so this is evidently long-standing behavior. Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") Reported-by: Geumhwan Yu Diagnosed-by: Geumhwan Yu Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_input.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cc05ec1faac89e..233b77890795cf 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2473,8 +2473,22 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp, */ static inline bool tcp_packet_delayed(const struct tcp_sock *tp) { - return tp->retrans_stamp && - tcp_tsopt_ecr_before(tp, tp->retrans_stamp); + const struct sock *sk = (const struct sock *)tp; + + if (tp->retrans_stamp && + tcp_tsopt_ecr_before(tp, tp->retrans_stamp)) + return true; /* got echoed TS before first retransmission */ + + /* Check if nothing was retransmitted (retrans_stamp==0), which may + * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp + * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear + * retrans_stamp even if we had retransmitted the SYN. + */ + if (!tp->retrans_stamp && /* no record of a retransmit/SYN? */ + sk->sk_state != TCP_SYN_SENT) /* not the FLAG_SYN_ACKED case? */ + return true; /* nothing was retransmitted */ + + return false; } /* Undo procedures. */ From b41b4cbd9655bcebcce941bef3601db8110335be Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Tue, 1 Oct 2024 20:05:16 +0000 Subject: [PATCH 16/87] tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe Fix tcp_enter_recovery() so that if there are no retransmits out then we zero retrans_stamp when entering fast recovery. This is necessary to fix two buggy behaviors. Currently a non-zero retrans_stamp value can persist across multiple back-to-back loss recovery episodes. This is because we generally only clears retrans_stamp if we are completely done with loss recoveries, and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This behavior causes two bugs: (1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed immediately by a new CA_Recovery, the retrans_stamp value can persist and can be a time before this new CA_Recovery episode starts. That means that timestamp-based undo will be using the wrong retrans_stamp (a value that is too old) when comparing incoming TS ecr values to retrans_stamp to see if the current fast recovery episode can be undone. (2) If there is a roughly minutes-long sequence of back-to-back fast recovery episodes, one after another (e.g. in a shallow-buffered or policed bottleneck), where each fast recovery successfully makes forward progress and recovers one window of sequence space (but leaves at least one retransmit in flight at the end of the recovery), followed by several RTOs, then the ETIMEDOUT check may be using the wrong retrans_stamp (a value set at the start of the first fast recovery in the sequence). This can cause a very premature ETIMEDOUT, killing the connection prematurely. This commit changes the code to zero retrans_stamp when entering fast recovery, when this is known to be safe (no retransmits are out in the network). That ensures that when starting a fast recovery episode, and it is safe to do so, retrans_stamp is set when we send the fast retransmit packet. That addresses both bug (1) and bug (2) by ensuring that (if no retransmits are out when we start a fast recovery) we use the initial fast retransmit of this fast recovery as the time value for undo and ETIMEDOUT calculations. This makes intuitive sense, since the start of a new fast recovery episode (in a scenario where no lost packets are out in the network) means that the connection has made forward progress since the last RTO or fast recovery, and we should thus "restart the clock" used for both undo and ETIMEDOUT logic. Note that if when we start fast recovery there *are* retransmits out in the network, there can still be undesirable (1)/(2) issues. For example, after this patch we can still have the (1) and (2) problems in cases like this: + round 1: sender sends flight 1 + round 2: sender receives SACKs and enters fast recovery 1, retransmits some packets in flight 1 and then sends some new data as flight 2 + round 3: sender receives some SACKs for flight 2, notes losses, and retransmits some packets to fill the holes in flight 2 + fast recovery has some lost retransmits in flight 1 and continues for one or more rounds sending retransmits for flight 1 and flight 2 + fast recovery 1 completes when snd_una reaches high_seq at end of flight 1 + there are still holes in the SACK scoreboard in flight 2, so we enter fast recovery 2, but some retransmits in the flight 2 sequence range are still in flight (retrans_out > 0), so we can't execute the new retrans_stamp=0 added here to clear retrans_stamp It's not yet clear how to fix these remaining (1)/(2) issues in an efficient way without breaking undo behavior, given that retrans_stamp is currently used for undo and ETIMEDOUT. Perhaps the optimal (but expensive) strategy would be to set retrans_stamp to the timestamp of the earliest outstanding retransmit when entering fast recovery. But at least this commit makes things better. Note that this does not change the semantics of retrans_stamp; it simply makes retrans_stamp accurate in some cases where it was not before: (1) Some loss recovery, followed by an immediate entry into a fast recovery, where there are no retransmits out when entering the fast recovery. (2) When a TFO server has a SYNACK retransmit that sets retrans_stamp, and then the ACK that completes the 3-way handshake has SACK blocks that trigger a fast recovery. In this case when entering fast recovery we want to zero out the retrans_stamp from the TFO SYNACK retransmit, and set the retrans_stamp based on the timestamp of the fast recovery. We introduce a tcp_retrans_stamp_cleanup() helper, because this two-line sequence already appears in 3 places and is about to appear in 2 more as a result of this bug fix patch series. Once this bug fix patches series in the net branch makes it into the net-next branch we'll update the 3 other call sites to use the new helper. This is a long-standing issue. The Fixes tag below is chosen to be the oldest commit at which the patch will apply cleanly, which is from Linux v3.5 in 2012. Fixes: 1fbc340514fc ("tcp: early retransmit: tcp_enter_recovery()") Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_input.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 233b77890795cf..8e47907c76d5e4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2522,6 +2522,16 @@ static bool tcp_any_retrans_done(const struct sock *sk) return false; } +/* If loss recovery is finished and there are no retransmits out in the + * network, then we clear retrans_stamp so that upon the next loss recovery + * retransmits_timed_out() and timestamp-undo are using the correct value. + */ +static void tcp_retrans_stamp_cleanup(struct sock *sk) +{ + if (!tcp_any_retrans_done(sk)) + tcp_sk(sk)->retrans_stamp = 0; +} + static void DBGUNDO(struct sock *sk, const char *msg) { #if FASTRETRANS_DEBUG > 1 @@ -2889,6 +2899,9 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack) struct tcp_sock *tp = tcp_sk(sk); int mib_idx; + /* Start the clock with our fast retransmit, for undo and ETIMEDOUT. */ + tcp_retrans_stamp_cleanup(sk); + if (tcp_is_reno(tp)) mib_idx = LINUX_MIB_TCPRENORECOVERY; else From 27c80efcc20486c82698f05f00e288b44513c86b Mon Sep 17 00:00:00 2001 From: Neal Cardwell Date: Tue, 1 Oct 2024 20:05:17 +0000 Subject: [PATCH 17/87] tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp if retransmits are outstanding. tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll need to zero retrans_stamp here to prevent spurious retransmits_timed_out(). The logic to zero retrans_stamp is from this 2019 commit: commit cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") However, in the corner case where the ACK of our TFO SYNACK carried some SACK blocks that caused us to enter TCP_CA_Recovery then that non-zero retrans_stamp corresponds to the active fast recovery, and we need to leave retrans_stamp with its current non-zero value, for correct ETIMEDOUT and undo behavior. Fixes: cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_input.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8e47907c76d5e4..2d844e1f867f0a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6684,10 +6684,17 @@ static void tcp_rcv_synrecv_state_fastopen(struct sock *sk) if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss && !tp->packets_out) tcp_try_undo_recovery(sk); - /* Reset rtx states to prevent spurious retransmits_timed_out() */ tcp_update_rto_time(tp); - tp->retrans_stamp = 0; inet_csk(sk)->icsk_retransmits = 0; + /* In tcp_fastopen_synack_timer() on the first SYNACK RTO we set + * retrans_stamp but don't enter CA_Loss, so in case that happened we + * need to zero retrans_stamp here to prevent spurious + * retransmits_timed_out(). However, if the ACK of our SYNACK caused us + * to enter CA_Recovery then we need to leave retrans_stamp as it was + * set entering CA_Recovery, for correct retransmits_timed_out() and + * undo behavior. + */ + tcp_retrans_stamp_cleanup(sk); /* Once we leave TCP_SYN_RECV or TCP_FIN_WAIT_1, * we no longer need req so release it. From bc212465326e8587325f520a052346f0b57360e6 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 1 Oct 2024 14:26:58 +0100 Subject: [PATCH 18/87] rxrpc: Fix a race between socket set up and I/O thread creation In rxrpc_open_socket(), it sets up the socket and then sets up the I/O thread that will handle it. This is a problem, however, as there's a gap between the two phases in which a packet may come into rxrpc_encap_rcv() from the UDP packet but we oops when trying to wake the not-yet created I/O thread. As a quick fix, just make rxrpc_encap_rcv() discard the packet if there's no I/O thread yet. A better, but more intrusive fix would perhaps be to rearrange things such that the socket creation is done by the I/O thread. Fixes: a275da62e8c1 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Signed-off-by: David Howells cc: yuxuanzhe@outlook.com cc: Marc Dionne cc: Simon Horman cc: linux-afs@lists.infradead.org Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241001132702.3122709-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/ar-internal.h | 2 +- net/rxrpc/io_thread.c | 10 ++++++++-- net/rxrpc/local_object.c | 2 +- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 80d682f89b2332..d0fd37bdcfe9c8 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -1056,7 +1056,7 @@ bool rxrpc_direct_abort(struct sk_buff *skb, enum rxrpc_abort_reason why, int rxrpc_io_thread(void *data); static inline void rxrpc_wake_up_io_thread(struct rxrpc_local *local) { - wake_up_process(local->io_thread); + wake_up_process(READ_ONCE(local->io_thread)); } static inline bool rxrpc_protocol_error(struct sk_buff *skb, enum rxrpc_abort_reason why) diff --git a/net/rxrpc/io_thread.c b/net/rxrpc/io_thread.c index 0300baa9afcd39..07c74c77d80214 100644 --- a/net/rxrpc/io_thread.c +++ b/net/rxrpc/io_thread.c @@ -27,11 +27,17 @@ int rxrpc_encap_rcv(struct sock *udp_sk, struct sk_buff *skb) { struct sk_buff_head *rx_queue; struct rxrpc_local *local = rcu_dereference_sk_user_data(udp_sk); + struct task_struct *io_thread; if (unlikely(!local)) { kfree_skb(skb); return 0; } + io_thread = READ_ONCE(local->io_thread); + if (!io_thread) { + kfree_skb(skb); + return 0; + } if (skb->tstamp == 0) skb->tstamp = ktime_get_real(); @@ -47,7 +53,7 @@ int rxrpc_encap_rcv(struct sock *udp_sk, struct sk_buff *skb) #endif skb_queue_tail(rx_queue, skb); - rxrpc_wake_up_io_thread(local); + wake_up_process(io_thread); return 0; } @@ -565,7 +571,7 @@ int rxrpc_io_thread(void *data) __set_current_state(TASK_RUNNING); rxrpc_see_local(local, rxrpc_local_stop); rxrpc_destroy_local(local); - local->io_thread = NULL; + WRITE_ONCE(local->io_thread, NULL); rxrpc_see_local(local, rxrpc_local_stopped); return 0; } diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c index 504453c688d751..f9623ace22016f 100644 --- a/net/rxrpc/local_object.c +++ b/net/rxrpc/local_object.c @@ -232,7 +232,7 @@ static int rxrpc_open_socket(struct rxrpc_local *local, struct net *net) } wait_for_completion(&local->io_thread_ready); - local->io_thread = io_thread; + WRITE_ONCE(local->io_thread, io_thread); _leave(" = 0"); return 0; From 7a310f8d7dfe2d92a1f31ddb5357bfdd97eed273 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 1 Oct 2024 14:26:59 +0100 Subject: [PATCH 19/87] rxrpc: Fix uninitialised variable in rxrpc_send_data() Fix the uninitialised txb variable in rxrpc_send_data() by moving the code that loads it above all the jumps to maybe_error, txb being stored back into call->tx_pending right before the normal return. Fixes: b0f571ecd794 ("rxrpc: Fix locking in rxrpc's sendmsg") Reported-by: Dan Carpenter Closes: https://lists.infradead.org/pipermail/linux-afs/2024-October/008896.html Signed-off-by: David Howells cc: Marc Dionne cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241001132702.3122709-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski --- net/rxrpc/sendmsg.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c index 894b8fa68e5e95..23d18fe5de9f0d 100644 --- a/net/rxrpc/sendmsg.c +++ b/net/rxrpc/sendmsg.c @@ -303,6 +303,11 @@ static int rxrpc_send_data(struct rxrpc_sock *rx, sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); reload: + txb = call->tx_pending; + call->tx_pending = NULL; + if (txb) + rxrpc_see_txbuf(txb, rxrpc_txbuf_see_send_more); + ret = -EPIPE; if (sk->sk_shutdown & SEND_SHUTDOWN) goto maybe_error; @@ -329,11 +334,6 @@ static int rxrpc_send_data(struct rxrpc_sock *rx, goto maybe_error; } - txb = call->tx_pending; - call->tx_pending = NULL; - if (txb) - rxrpc_see_txbuf(txb, rxrpc_txbuf_see_send_more); - do { if (!txb) { size_t remain; From 2d7a098b9dbe78b6b56694184315fbc1647719b7 Mon Sep 17 00:00:00 2001 From: Leo Stone Date: Sat, 28 Sep 2024 17:49:34 -0700 Subject: [PATCH 20/87] Documentation: networking/tcp_ao: typo and grammar fixes Fix multiple grammatical issues and add a missing period to improve readability. Signed-off-by: Leo Stone Reviewed-by: Simon Horman Link: https://patch.msgid.link/20240929005001.370991-1-leocstone@gmail.com Signed-off-by: Jakub Kicinski --- Documentation/networking/tcp_ao.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst index e96e62d1dab304..d5b6d0df63c351 100644 --- a/Documentation/networking/tcp_ao.rst +++ b/Documentation/networking/tcp_ao.rst @@ -9,7 +9,7 @@ segments between trusted peers. It adds a new TCP header option with a Message Authentication Code (MAC). MACs are produced from the content of a TCP segment using a hashing function with a password known to both peers. The intent of TCP-AO is to deprecate TCP-MD5 providing better security, -key rotation and support for variety of hashing algorithms. +key rotation and support for a variety of hashing algorithms. 1. Introduction =============== @@ -164,9 +164,9 @@ A: It should not, no action needs to be performed [7.5.2.e]:: is not available, no action is required (RNextKeyID of a received segment needs to match the MKT’s SendID). -Q: How current_key is set and when does it change? It is a user-triggered -change, or is it by a request from the remote peer? Is it set by the user -explicitly, or by a matching rule? +Q: How is current_key set, and when does it change? Is it a user-triggered +change, or is it triggered by a request from the remote peer? Is it set by the +user explicitly, or by a matching rule? A: current_key is set by RNextKeyID [6.1]:: @@ -233,8 +233,8 @@ always have one current_key [3.3]:: Q: Can a non-TCP-AO connection become a TCP-AO-enabled one? -A: No: for already established non-TCP-AO connection it would be impossible -to switch using TCP-AO as the traffic key generation requires the initial +A: No: for an already established non-TCP-AO connection it would be impossible +to switch to using TCP-AO, as the traffic key generation requires the initial sequence numbers. Paraphrasing, starting using TCP-AO would require re-establishing the TCP connection. @@ -292,7 +292,7 @@ no transparency is really needed and modern BGP daemons already have Linux provides a set of ``setsockopt()s`` and ``getsockopt()s`` that let userspace manage TCP-AO on a per-socket basis. In order to add/delete MKTs -``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used +``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used. It is not allowed to add a key on an established non-TCP-AO connection as well as to remove the last key from TCP-AO connection. @@ -361,7 +361,7 @@ not implemented. 4. ``setsockopt()`` vs ``accept()`` race ======================================== -In contrast with TCP-MD5 established connection which has just one key, +In contrast with an established TCP-MD5 connection which has just one key, TCP-AO connections may have many keys, which means that accepted connections on a listen socket may have any amount of keys as well. As copying all those keys on a first properly signed SYN would make the request socket bigger, that @@ -374,7 +374,7 @@ keys from sockets that were already established, but not yet ``accept()``'ed, hanging in the accept queue. The reverse is valid as well: if userspace adds a new key for a peer on -a listener socket, the established sockets in accept queue won't +a listener socket, the established sockets in the accept queue won't have the new keys. At this moment, the resolution for the two races: @@ -382,7 +382,7 @@ At this moment, the resolution for the two races: and ``setsockopt(TCP_AO_DEL_KEY)`` vs ``accept()`` is delegated to userspace. This means that it's expected that userspace would check the MKTs on the socket that was returned by ``accept()`` to verify that any key rotation that -happened on listen socket is reflected on the newly established connection. +happened on the listen socket is reflected on the newly established connection. This is a similar "do-nothing" approach to TCP-MD5 from the kernel side and may be changed later by introducing new flags to ``tcp_ao_add`` From 1f9fc48fd302be3311186152225ef195e6139d7a Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Tue, 1 Oct 2024 17:02:06 +0300 Subject: [PATCH 21/87] net: dsa: sja1105: fix reception from VLAN-unaware bridges The blamed commit introduced an unexpected regression in the sja1105 driver. Packets from VLAN-unaware bridge ports get received correctly, but the protocol stack can't seem to decode them properly. For ds->untag_bridge_pvid users (thus also sja1105), the blamed commit did introduce a functional change: dsa_switch_rcv() used to call dsa_untag_bridge_pvid(), which looked like this: err = br_vlan_get_proto(br, &proto); if (err) return skb; /* Move VLAN tag from data to hwaccel */ if (!skb_vlan_tag_present(skb) && skb->protocol == htons(proto)) { skb = skb_vlan_untag(skb); if (!skb) return NULL; } and now it calls dsa_software_vlan_untag() which has just this: /* Move VLAN tag from data to hwaccel */ if (!skb_vlan_tag_present(skb)) { skb = skb_vlan_untag(skb); if (!skb) return NULL; } thus lacks any skb->protocol == bridge VLAN protocol check. That check is deferred until a later check for skb->vlan_proto (in the hwaccel area). The new code is problematic because, for VLAN-untagged packets, skb_vlan_untag() blindly takes the 4 bytes starting with the EtherType and turns them into a hwaccel VLAN tag. This is what breaks the protocol stack. It would be tempting to "make it work as before" and only call skb_vlan_untag() for those packets with the skb->protocol actually representing a VLAN. But the premise of the newly introduced dsa_software_vlan_untag() core function is not wrong. Drivers set ds->untag_bridge_pvid or ds->untag_vlan_aware_bridge_pvid presumably because they send all traffic to the CPU reception path as VLAN-tagged. So why should we spend any additional CPU cycles assuming that the packet may be VLAN-untagged? And why does the sja1105 driver opt into ds->untag_bridge_pvid if it doesn't always deliver packets to the CPU as VLAN-tagged? The answer to the latter question is indeed more interesting: it doesn't need to. This got done in commit 884be12f8566 ("net: dsa: sja1105: add support for imprecise RX"), because I thought it would be needed, but I didn't realize that it doesn't actually make a difference. As explained in the commit message of the blamed patch, ds->untag_bridge_pvid only makes a difference in the VLAN-untagged receive path of a bridge port. However, in that operating mode, tag_sja1105.c makes use of VLAN tags with the ETH_P_SJA1105 TPID, and it decodes and consumes these VLAN tags as if they were DSA tags (aka tag_8021q operation). Even if commit 884be12f8566 ("net: dsa: sja1105: add support for imprecise RX") added this logic in sja1105_bridge_vlan_add(): /* Always install bridge VLANs as egress-tagged on the CPU port. */ if (dsa_is_cpu_port(ds, port)) flags = 0; that was for _bridge_ VLANs, which are _not_ committed to hardware in VLAN-unaware mode (aka the mode where ds->untag_bridge_pvid does anything at all). Even prior to that change, the tag_8021q VLANs were always installed as egress-tagged on the CPU port, see dsa_switch_tag_8021q_vlan_add(): u16 flags = 0; // egress-tagged, non-PVID if (dsa_port_is_user(dp)) flags |= BRIDGE_VLAN_INFO_UNTAGGED | BRIDGE_VLAN_INFO_PVID; err = dsa_port_do_tag_8021q_vlan_add(dp, info->vid, flags); if (err) return err; Whether the sja1105 driver needs the new flag, ds->untag_vlan_aware_bridge_pvid, rather than ds->untag_bridge_pvid, is a separate discussion. To fix the current bug in VLAN-unaware bridge mode, I would argue that the sja1105 driver should not request something it doesn't need, rather than complicating the core DSA helper. Whereas before the blamed commit, this setting was harmless, now it has caused breakage. Fixes: 93e4649efa96 ("net: dsa: provide a software untagging function on RX for VLAN-aware bridges") Signed-off-by: Vladimir Oltean Link: https://patch.msgid.link/20241001140206.50933-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/dsa/sja1105/sja1105_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c index bc7e50dcb57c77..d0563ef59acf64 100644 --- a/drivers/net/dsa/sja1105/sja1105_main.c +++ b/drivers/net/dsa/sja1105/sja1105_main.c @@ -3158,7 +3158,6 @@ static int sja1105_setup(struct dsa_switch *ds) * TPID is ETH_P_SJA1105, and the VLAN ID is the port pvid. */ ds->vlan_filtering_is_global = true; - ds->untag_bridge_pvid = true; ds->fdb_isolation = true; ds->max_num_bridges = DSA_TAG_8021Q_MAX_NUM_BRIDGES; From f9ff7665cd128012868098bbd07e28993e314fdb Mon Sep 17 00:00:00 2001 From: Andy Roulin Date: Tue, 1 Oct 2024 08:43:59 -0700 Subject: [PATCH 22/87] netfilter: br_netfilter: fix panic with metadata_dst skb Fix a kernel panic in the br_netfilter module when sending untagged traffic via a VxLAN device. This happens during the check for fragmentation in br_nf_dev_queue_xmit. It is dependent on: 1) the br_netfilter module being loaded; 2) net.bridge.bridge-nf-call-iptables set to 1; 3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port; 4) untagged frames with size higher than the VxLAN MTU forwarded/flooded When forwarding the untagged packet to the VxLAN bridge port, before the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL. Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check for frames that needs to be fragmented: frames with higher MTU than the VxLAN device end up calling br_nf_ip_fragment, which in turns call ip_skb_dst_mtu. The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst with valid dst->dev, thus the crash. This case was never supported in the first place, so drop the packet instead. PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data. [ 176.291791] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000110 [ 176.292101] Mem abort info: [ 176.292184] ESR = 0x0000000096000004 [ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits [ 176.292530] SET = 0, FnV = 0 [ 176.292709] EA = 0, S1PTW = 0 [ 176.292862] FSC = 0x04: level 0 translation fault [ 176.293013] Data abort info: [ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000 [ 176.294166] [0000000000000110] pgd=0000000000000000, p4d=0000000000000000 [ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth br_netfilter bridge stp llc ipv6 crct10dif_ce [ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted 6.8.0-rc3-g5b3fbd61b9d1 #2 [ 176.296314] Hardware name: linux,dummy-virt (DT) [ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter] [ 176.297636] sp : ffff800080003630 [ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27: ffff6828c49ad9f8 [ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24: 00000000000003e8 [ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21: ffff6828c3b16d28 [ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18: 0000000000000014 [ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15: 0000000095744632 [ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12: ffffb7e137926a70 [ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 : 0000000000000000 [ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 : f20e0100bebafeca [ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 : 0000000000000000 [ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 : ffff6828c7f918f0 [ 176.300889] Call trace: [ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter] [ 176.301703] nf_hook_slow+0x48/0x124 [ 176.302060] br_forward_finish+0xc8/0xe8 [bridge] [ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter] [ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter] [ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter] [ 176.303359] nf_hook_slow+0x48/0x124 [ 176.303803] __br_forward+0xc4/0x194 [bridge] [ 176.304013] br_flood+0xd4/0x168 [bridge] [ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge] [ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter] [ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter] [ 176.305428] br_handle_frame+0x244/0x3cc [bridge] [ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc [ 176.306080] __netif_receive_skb_one_core+0x40/0x8c [ 176.306197] __netif_receive_skb+0x18/0x64 [ 176.306369] process_backlog+0x80/0x124 [ 176.306540] __napi_poll+0x38/0x17c [ 176.306636] net_rx_action+0x124/0x26c [ 176.306758] __do_softirq+0x100/0x26c [ 176.307051] ____do_softirq+0x10/0x1c [ 176.307162] call_on_irq_stack+0x24/0x4c [ 176.307289] do_softirq_own_stack+0x1c/0x2c [ 176.307396] do_softirq+0x54/0x6c [ 176.307485] __local_bh_enable_ip+0x8c/0x98 [ 176.307637] __dev_queue_xmit+0x22c/0xd28 [ 176.307775] neigh_resolve_output+0xf4/0x1a0 [ 176.308018] ip_finish_output2+0x1c8/0x628 [ 176.308137] ip_do_fragment+0x5b4/0x658 [ 176.308279] ip_fragment.constprop.0+0x48/0xec [ 176.308420] __ip_finish_output+0xa4/0x254 [ 176.308593] ip_finish_output+0x34/0x130 [ 176.308814] ip_output+0x6c/0x108 [ 176.308929] ip_send_skb+0x50/0xf0 [ 176.309095] ip_push_pending_frames+0x30/0x54 [ 176.309254] raw_sendmsg+0x758/0xaec [ 176.309568] inet_sendmsg+0x44/0x70 [ 176.309667] __sys_sendto+0x110/0x178 [ 176.309758] __arm64_sys_sendto+0x28/0x38 [ 176.309918] invoke_syscall+0x48/0x110 [ 176.310211] el0_svc_common.constprop.0+0x40/0xe0 [ 176.310353] do_el0_svc+0x1c/0x28 [ 176.310434] el0_svc+0x34/0xb4 [ 176.310551] el0t_64_sync_handler+0x120/0x12c [ 176.310690] el0t_64_sync+0x190/0x194 [ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860) [ 176.315743] ---[ end trace 0000000000000000 ]--- [ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000 [ 176.316564] PHYS_OFFSET: 0xffff97d780000000 [ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b [ 176.317210] Memory Limit: none [ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal Exception in interrupt ]---\ Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Reviewed-by: Ido Schimmel Signed-off-by: Andy Roulin Acked-by: Nikolay Aleksandrov Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski --- net/bridge/br_netfilter_hooks.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 0e8bc0ea617506..1d458e9da660c9 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -879,6 +880,10 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff return br_dev_queue_push_xmit(net, sk, skb); } + /* Fragmentation on metadata/template dst is not supported */ + if (unlikely(!skb_valid_dst(skb))) + goto drop; + /* This is wrong! We should preserve the original fragment * boundaries by preserving frag_list rather than refragmenting. */ From bc4d22b72a2d8d22b03b89083db4937dc427ddaa Mon Sep 17 00:00:00 2001 From: Andy Roulin Date: Tue, 1 Oct 2024 08:44:00 -0700 Subject: [PATCH 23/87] selftests: add regression test for br_netfilter panic Add a new netfilter selftests to test against br_netfilter panics when VxLAN single-device is used together with untagged traffic and high MTU. Reviewed-by: Petr Machata Signed-off-by: Andy Roulin Acked-by: Nikolay Aleksandrov Link: https://patch.msgid.link/20241001154400.22787-3-aroulin@nvidia.com Signed-off-by: Jakub Kicinski --- .../testing/selftests/net/netfilter/Makefile | 1 + tools/testing/selftests/net/netfilter/config | 2 + .../selftests/net/netfilter/vxlan_mtu_frag.sh | 121 ++++++++++++++++++ 3 files changed, 124 insertions(+) create mode 100755 tools/testing/selftests/net/netfilter/vxlan_mtu_frag.sh diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile index e6c9e777feadc6..542f7886a0bc2a 100644 --- a/tools/testing/selftests/net/netfilter/Makefile +++ b/tools/testing/selftests/net/netfilter/Makefile @@ -31,6 +31,7 @@ TEST_PROGS += nft_tproxy_tcp.sh TEST_PROGS += nft_tproxy_udp.sh TEST_PROGS += nft_zones_many.sh TEST_PROGS += rpath.sh +TEST_PROGS += vxlan_mtu_frag.sh TEST_PROGS += xt_string.sh TEST_PROGS_EXTENDED = nft_concat_range_perf.sh diff --git a/tools/testing/selftests/net/netfilter/config b/tools/testing/selftests/net/netfilter/config index c5fe7b34eaf19a..43d8b500d391a2 100644 --- a/tools/testing/selftests/net/netfilter/config +++ b/tools/testing/selftests/net/netfilter/config @@ -7,6 +7,7 @@ CONFIG_BRIDGE_EBT_REDIRECT=m CONFIG_BRIDGE_EBT_T_FILTER=m CONFIG_BRIDGE_NETFILTER=m CONFIG_BRIDGE_NF_EBTABLES=m +CONFIG_BRIDGE_VLAN_FILTERING=y CONFIG_CGROUP_BPF=y CONFIG_DUMMY=m CONFIG_INET_ESP=m @@ -84,6 +85,7 @@ CONFIG_NFT_SYNPROXY=m CONFIG_NFT_TPROXY=m CONFIG_VETH=m CONFIG_VLAN_8021Q=m +CONFIG_VXLAN=m CONFIG_XFRM_USER=m CONFIG_XFRM_STATISTICS=y CONFIG_NET_PKTGEN=m diff --git a/tools/testing/selftests/net/netfilter/vxlan_mtu_frag.sh b/tools/testing/selftests/net/netfilter/vxlan_mtu_frag.sh new file mode 100755 index 00000000000000..912cb9583af1b5 --- /dev/null +++ b/tools/testing/selftests/net/netfilter/vxlan_mtu_frag.sh @@ -0,0 +1,121 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +source lib.sh + +if ! modprobe -q -n br_netfilter 2>&1; then + echo "SKIP: Test needs br_netfilter kernel module" + exit $ksft_skip +fi + +cleanup() +{ + cleanup_all_ns +} + +trap cleanup EXIT + +setup_ns host vtep router + +create_topology() +{ + ip link add host-eth0 netns "$host" type veth peer name vtep-host netns "$vtep" + ip link add vtep-router netns "$vtep" type veth peer name router-vtep netns "$router" +} + +setup_host() +{ + # bring ports up + ip -n "$host" addr add 10.0.0.1/24 dev host-eth0 + ip -n "$host" link set host-eth0 up + + # Add VLAN 10,20 + for vid in 10 20; do + ip -n "$host" link add link host-eth0 name host-eth0.$vid type vlan id $vid + ip -n "$host" addr add 10.0.$vid.1/24 dev host-eth0.$vid + ip -n "$host" link set host-eth0.$vid up + done +} + +setup_vtep() +{ + # create bridge on vtep + ip -n "$vtep" link add name br0 type bridge + ip -n "$vtep" link set br0 type bridge vlan_filtering 1 + + # VLAN 10 is untagged PVID + ip -n "$vtep" link set dev vtep-host master br0 + bridge -n "$vtep" vlan add dev vtep-host vid 10 pvid untagged + + # VLAN 20 as other VID + ip -n "$vtep" link set dev vtep-host master br0 + bridge -n "$vtep" vlan add dev vtep-host vid 20 + + # single-vxlan device on vtep + ip -n "$vtep" address add dev vtep-router 60.0.0.1/24 + ip -n "$vtep" link add dev vxd type vxlan external \ + vnifilter local 60.0.0.1 remote 60.0.0.2 dstport 4789 ttl 64 + ip -n "$vtep" link set vxd master br0 + + # Add VLAN-VNI 1-1 mappings + bridge -n "$vtep" link set dev vxd vlan_tunnel on + for vid in 10 20; do + bridge -n "$vtep" vlan add dev vxd vid $vid + bridge -n "$vtep" vlan add dev vxd vid $vid tunnel_info id $vid + bridge -n "$vtep" vni add dev vxd vni $vid + done + + # bring ports up + ip -n "$vtep" link set vxd up + ip -n "$vtep" link set vtep-router up + ip -n "$vtep" link set vtep-host up + ip -n "$vtep" link set dev br0 up +} + +setup_router() +{ + # bring ports up + ip -n "$router" link set router-vtep up +} + +setup() +{ + modprobe -q br_netfilter + create_topology + setup_host + setup_vtep + setup_router +} + +test_large_mtu_untagged_traffic() +{ + ip -n "$vtep" link set vxd mtu 1000 + ip -n "$host" neigh add 10.0.0.2 lladdr ca:fe:ba:be:00:01 dev host-eth0 + ip netns exec "$host" \ + ping -q 10.0.0.2 -I host-eth0 -c 1 -W 0.5 -s2000 > /dev/null 2>&1 + return 0 +} + +test_large_mtu_tagged_traffic() +{ + for vid in 10 20; do + ip -n "$vtep" link set vxd mtu 1000 + ip -n "$host" neigh add 10.0.$vid.2 lladdr ca:fe:ba:be:00:01 dev host-eth0.$vid + ip netns exec "$host" \ + ping -q 10.0.$vid.2 -I host-eth0.$vid -c 1 -W 0.5 -s2000 > /dev/null 2>&1 + done + return 0 +} + +do_test() +{ + # Frames will be dropped so ping will not succeed + # If it doesn't panic, it passes + test_large_mtu_tagged_traffic + test_large_mtu_untagged_traffic +} + +setup && \ +echo "Test for VxLAN fragmentation with large MTU in br_netfilter:" && \ +do_test && echo "PASS!" +exit $? From de390657b5d6f7deb9d1d36aaf45f02ba51ec9dc Mon Sep 17 00:00:00 2001 From: Nick Child Date: Tue, 1 Oct 2024 11:32:00 -0500 Subject: [PATCH 24/87] ibmvnic: Inspect header requirements before using scrq direct Previously, the TX header requirement for standard frames was ignored. This requirement is a bitstring sent from the VIOS which maps to the type of header information needed during TX. If no header information, is needed then send subcrq direct can be used (which can be more performant). This bitstring was previously ignored for standard packets (AKA non LSO, non CSO) due to the belief that the bitstring was over-cautionary. It turns out that there are some configurations where the backing device does need header information for transmission of standard packets. If the information is not supplied then this causes continuous "Adapter error" transport events. Therefore, this bitstring should be respected and observed before considering the use of send subcrq direct. Fixes: 74839f7a8268 ("ibmvnic: Introduce send sub-crq direct") Signed-off-by: Nick Child Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241001163200.1802522-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 87e693a814331a..97425c06e1ed7f 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -2472,9 +2472,11 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) /* if we are going to send_subcrq_direct this then we need to * update the checksum before copying the data into ltb. Essentially * these packets force disable CSO so that we can guarantee that - * FW does not need header info and we can send direct. + * FW does not need header info and we can send direct. Also, vnic + * server must be able to xmit standard packets without header data */ - if (!skb_is_gso(skb) && !ind_bufp->index && !netdev_xmit_more()) { + if (*hdrs == 0 && !skb_is_gso(skb) && + !ind_bufp->index && !netdev_xmit_more()) { use_scrq_send_direct = true; if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) From 9f49d14ec41ce7be647028d7d34dea727af55272 Mon Sep 17 00:00:00 2001 From: Kacper Ludwinski Date: Wed, 2 Oct 2024 14:10:16 +0900 Subject: [PATCH 25/87] selftests: net: no_forwarding: fix VID for $swp2 in one_bridge_two_pvids() test Currently, the second bridge command overwrites the first one. Fix this by adding this VID to the interface behind $swp2. The one_bridge_two_pvids() test intends to check that there is no leakage of traffic between bridge ports which have a single VLAN - the PVID VLAN. Because of a typo, port $swp1 is configured with a PVID twice (second command overwrites first), and $swp2 isn't configured at all (and since the bridge vlan_default_pvid property is set to 0, this port will not have a PVID at all, so it will drop all untagged and priority-tagged traffic). So, instead of testing the configuration that was intended, we are testing a different one, where one port has PVID 2 and the other has no PVID. This incorrect version of the test should also pass, but is ineffective for its purpose, so fix the typo. This typo has an impact on results of the test, potentially leading to wrong conclusions regarding the functionality of a network device. The tests results: TEST: Switch ports in VLAN-aware bridge with different PVIDs: Unicast non-IP untagged [ OK ] Multicast non-IP untagged [ OK ] Broadcast non-IP untagged [ OK ] Unicast IPv4 untagged [ OK ] Multicast IPv4 untagged [ OK ] Unicast IPv6 untagged [ OK ] Multicast IPv6 untagged [ OK ] Unicast non-IP VID 1 [ OK ] Multicast non-IP VID 1 [ OK ] Broadcast non-IP VID 1 [ OK ] Unicast IPv4 VID 1 [ OK ] Multicast IPv4 VID 1 [ OK ] Unicast IPv6 VID 1 [ OK ] Multicast IPv6 VID 1 [ OK ] Unicast non-IP VID 4094 [ OK ] Multicast non-IP VID 4094 [ OK ] Broadcast non-IP VID 4094 [ OK ] Unicast IPv4 VID 4094 [ OK ] Multicast IPv4 VID 4094 [ OK ] Unicast IPv6 VID 4094 [ OK ] Multicast IPv6 VID 4094 [ OK ] Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test") Reviewed-by: Hangbin Liu Reviewed-by: Shuah Khan Signed-off-by: Kacper Ludwinski Link: https://patch.msgid.link/20241002051016.849-1-kac.ludwinski@icloud.com Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/forwarding/no_forwarding.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh index 9e677aa64a06a6..694ece9ba3a742 100755 --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh @@ -202,7 +202,7 @@ one_bridge_two_pvids() ip link set $swp2 master br0 bridge vlan add dev $swp1 vid 1 pvid untagged - bridge vlan add dev $swp1 vid 2 pvid untagged + bridge vlan add dev $swp2 vid 2 pvid untagged run_test "Switch ports in VLAN-aware bridge with different PVIDs" From dda3529d2e84e2ee7b97158c9cdf5e10308f37bc Mon Sep 17 00:00:00 2001 From: Kory Maincent Date: Wed, 2 Oct 2024 14:17:05 +0200 Subject: [PATCH 26/87] net: pse-pd: Fix enabled status mismatch PSE controllers like the TPS23881 can forcefully turn off their configuration state. In such cases, the is_enabled() and get_status() callbacks will report the PSE as disabled, while admin_state_enabled will show it as enabled. This mismatch can lead the user to attempt to enable it, but no action is taken as admin_state_enabled remains set. The solution is to disable the PSE before enabling it, ensuring the actual status matches admin_state_enabled. Fixes: d83e13761d5b ("net: pse-pd: Use regulator framework within PSE framework") Signed-off-by: Kory Maincent Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20241002121706.246143-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski --- drivers/net/pse-pd/pse_core.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c index 4f032b16a8a0a6..f8e6854781e6ee 100644 --- a/drivers/net/pse-pd/pse_core.c +++ b/drivers/net/pse-pd/pse_core.c @@ -785,6 +785,17 @@ static int pse_ethtool_c33_set_config(struct pse_control *psec, */ switch (config->c33_admin_control) { case ETHTOOL_C33_PSE_ADMIN_STATE_ENABLED: + /* We could have mismatch between admin_state_enabled and + * state reported by regulator_is_enabled. This can occur when + * the PI is forcibly turn off by the controller. Call + * regulator_disable on that case to fix the counters state. + */ + if (psec->pcdev->pi[psec->id].admin_state_enabled && + !regulator_is_enabled(psec->ps)) { + err = regulator_disable(psec->ps); + if (err) + break; + } if (!psec->pcdev->pi[psec->id].admin_state_enabled) err = regulator_enable(psec->ps); break; From 08d1914293dae38350b8088980e59fbc699a72fe Mon Sep 17 00:00:00 2001 From: Luiz Augusto von Dentz Date: Mon, 30 Sep 2024 13:26:21 -0400 Subject: [PATCH 27/87] Bluetooth: RFCOMM: FIX possible deadlock in rfcomm_sk_state_change rfcomm_sk_state_change attempts to use sock_lock so it must never be called with it locked but rfcomm_sock_ioctl always attempt to lock it causing the following trace: ====================================================== WARNING: possible circular locking dependency detected 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Not tainted ------------------------------------------------------ syz-executor386/5093 is trying to acquire lock: ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1671 [inline] ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: rfcomm_sk_state_change+0x5b/0x310 net/bluetooth/rfcomm/sock.c:73 but task is already holding lock: ffff88807badfd28 (&d->lock){+.+.}-{3:3}, at: __rfcomm_dlc_close+0x226/0x6a0 net/bluetooth/rfcomm/core.c:491 Reported-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Tested-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 Fixes: 3241ad820dbb ("[Bluetooth] Add timestamp support to L2CAP, RFCOMM and SCO") Signed-off-by: Luiz Augusto von Dentz --- net/bluetooth/rfcomm/sock.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c index 37d63d768afb8c..f48250e3f2e103 100644 --- a/net/bluetooth/rfcomm/sock.c +++ b/net/bluetooth/rfcomm/sock.c @@ -865,9 +865,7 @@ static int rfcomm_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned lon if (err == -ENOIOCTLCMD) { #ifdef CONFIG_BT_RFCOMM_TTY - lock_sock(sk); err = rfcomm_dev_ioctl(sk, cmd, (void __user *) arg); - release_sock(sk); #else err = -EOPNOTSUPP; #endif From 18fd04ad856df07733f5bb07e7f7168e7443d393 Mon Sep 17 00:00:00 2001 From: Luiz Augusto von Dentz Date: Wed, 2 Oct 2024 11:17:26 -0400 Subject: [PATCH 28/87] Bluetooth: hci_conn: Fix UAF in hci_enhanced_setup_sync This checks if the ACL connection remains valid as it could be destroyed while hci_enhanced_setup_sync is pending on cmd_sync leading to the following trace: BUG: KASAN: slab-use-after-free in hci_enhanced_setup_sync+0x91b/0xa60 Read of size 1 at addr ffff888002328ffd by task kworker/u5:2/37 CPU: 0 UID: 0 PID: 37 Comm: kworker/u5:2 Not tainted 6.11.0-rc6-01300-g810be445d8d6 #7099 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: dump_stack_lvl+0x5d/0x80 ? hci_enhanced_setup_sync+0x91b/0xa60 print_report+0x152/0x4c0 ? hci_enhanced_setup_sync+0x91b/0xa60 ? __virt_addr_valid+0x1fa/0x420 ? hci_enhanced_setup_sync+0x91b/0xa60 kasan_report+0xda/0x1b0 ? hci_enhanced_setup_sync+0x91b/0xa60 hci_enhanced_setup_sync+0x91b/0xa60 ? __pfx_hci_enhanced_setup_sync+0x10/0x10 ? __pfx___mutex_lock+0x10/0x10 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x167/0x240 worker_thread+0x5b7/0xf60 ? __kthread_parkme+0xac/0x1c0 ? __pfx_worker_thread+0x10/0x10 ? __pfx_worker_thread+0x10/0x10 kthread+0x293/0x360 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Allocated by task 34: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __hci_conn_add+0x187/0x17d0 hci_connect_sco+0x2e1/0xb90 sco_sock_connect+0x2a2/0xb80 __sys_connect+0x227/0x2a0 __x64_sys_connect+0x6d/0xb0 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 37: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x101/0x160 kfree+0xd0/0x250 device_release+0x9a/0x210 kobject_put+0x151/0x280 hci_conn_del+0x448/0xbf0 hci_abort_conn_sync+0x46f/0x980 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 worker_thread+0x5b7/0xf60 kthread+0x293/0x360 ret_from_fork+0x2f/0x70 ret_from_fork_asm+0x1a/0x30 Cc: stable@vger.kernel.org Fixes: e07a06b4eb41 ("Bluetooth: Convert SCO configure_datapath to hci_sync") Signed-off-by: Luiz Augusto von Dentz --- net/bluetooth/hci_conn.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c index d083117ee36c3b..c4c74b82ed211c 100644 --- a/net/bluetooth/hci_conn.c +++ b/net/bluetooth/hci_conn.c @@ -289,6 +289,9 @@ static int hci_enhanced_setup_sync(struct hci_dev *hdev, void *data) kfree(conn_handle); + if (!hci_conn_valid(hdev, conn)) + return -ECANCELED; + bt_dev_dbg(hdev, "hcon %p", conn); configure_datapath_sync(hdev, &conn->codec); From 610712298b11b2914be00b35abe9326b5dbb62c8 Mon Sep 17 00:00:00 2001 From: Luiz Augusto von Dentz Date: Tue, 1 Oct 2024 11:21:37 -0400 Subject: [PATCH 29/87] Bluetooth: btusb: Don't fail external suspend requests Commit 4e0a1d8b0675 ("Bluetooth: btusb: Don't suspend when there are connections") introduces a check for connections to prevent auto-suspend but that actually ignored the fact the .suspend callback can be called for external suspend requests which Documentation/driver-api/usb/power-management.rst states the following: 'External suspend calls should never be allowed to fail in this way, only autosuspend calls. The driver can tell them apart by applying the :c:func:`PMSG_IS_AUTO` macro to the message argument to the ``suspend`` method; it will return True for internal PM events (autosuspend) and False for external PM events.' In addition to that align system suspend with USB suspend by using hci_suspend_dev since otherwise the stack would be expecting events such as advertising reports which may not be delivered while the transport is suspended. Fixes: 4e0a1d8b0675 ("Bluetooth: btusb: Don't suspend when there are connections") Signed-off-by: Luiz Augusto von Dentz Tested-by: Kiran K --- drivers/bluetooth/btusb.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c index f23c8801ad5cc4..a3e45b3060d1de 100644 --- a/drivers/bluetooth/btusb.c +++ b/drivers/bluetooth/btusb.c @@ -4038,16 +4038,29 @@ static void btusb_disconnect(struct usb_interface *intf) static int btusb_suspend(struct usb_interface *intf, pm_message_t message) { struct btusb_data *data = usb_get_intfdata(intf); + int err; BT_DBG("intf %p", intf); - /* Don't suspend if there are connections */ - if (hci_conn_count(data->hdev)) + /* Don't auto-suspend if there are connections; external suspend calls + * shall never fail. + */ + if (PMSG_IS_AUTO(message) && hci_conn_count(data->hdev)) return -EBUSY; if (data->suspend_count++) return 0; + /* Notify Host stack to suspend; this has to be done before stopping + * the traffic since the hci_suspend_dev itself may generate some + * traffic. + */ + err = hci_suspend_dev(data->hdev); + if (err) { + data->suspend_count--; + return err; + } + spin_lock_irq(&data->txlock); if (!(PMSG_IS_AUTO(message) && data->tx_in_flight)) { set_bit(BTUSB_SUSPENDING, &data->flags); @@ -4055,6 +4068,7 @@ static int btusb_suspend(struct usb_interface *intf, pm_message_t message) } else { spin_unlock_irq(&data->txlock); data->suspend_count--; + hci_resume_dev(data->hdev); return -EBUSY; } @@ -4175,6 +4189,8 @@ static int btusb_resume(struct usb_interface *intf) spin_unlock_irq(&data->txlock); schedule_work(&data->work); + hci_resume_dev(data->hdev); + return 0; failed: From 1dae9f1187189bc09ff6d25ca97ead711f7e26f9 Mon Sep 17 00:00:00 2001 From: Anastasia Kovaleva Date: Thu, 3 Oct 2024 13:44:31 +0300 Subject: [PATCH 30/87] net: Fix an unsafe loop on the list The kernel may crash when deleting a genetlink family if there are still listeners for that family: Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0 LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0 Call Trace: __netlink_clear_multicast_users+0x74/0xc0 genl_unregister_family+0xd4/0x2d0 Change the unsafe loop on the list to a safe one, because inside the loop there is an element removal from this list. Fixes: b8273570f802 ("genetlink: fix netns vs. netlink table locking (2)") Cc: stable@vger.kernel.org Signed-off-by: Anastasia Kovaleva Reviewed-by: Dmitry Bogdanov Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20241003104431.12391-1-a.kovaleva@yadro.com Signed-off-by: Jakub Kicinski --- include/net/sock.h | 2 ++ net/netlink/af_netlink.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index c58ca8dd561b73..db29c39e19a737 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -894,6 +894,8 @@ static inline void sk_add_bind_node(struct sock *sk, hlist_for_each_entry_safe(__sk, tmp, list, sk_node) #define sk_for_each_bound(__sk, list) \ hlist_for_each_entry(__sk, list, sk_bind_node) +#define sk_for_each_bound_safe(__sk, tmp, list) \ + hlist_for_each_entry_safe(__sk, tmp, list, sk_bind_node) /** * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 0b7a89db3ab74e..0a9287fadb47a2 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2136,8 +2136,9 @@ void __netlink_clear_multicast_users(struct sock *ksk, unsigned int group) { struct sock *sk; struct netlink_table *tbl = &nl_table[ksk->sk_protocol]; + struct hlist_node *tmp; - sk_for_each_bound(sk, &tbl->mc_list) + sk_for_each_bound_safe(sk, tmp, &tbl->mc_list) netlink_update_socket_mc(nlk_sk(sk), group, 0); } From 9234a2549cb6ac038bec36cc7c084218e9575513 Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Thu, 3 Oct 2024 21:03:21 +0200 Subject: [PATCH 31/87] net: phy: bcm84881: Fix some error handling paths If phy_read_mmd() fails, the error code stored in 'bmsr' should be returned instead of 'val' which is likely to be 0. Fixes: 75f4d8d10e01 ("net: phy: add Broadcom BCM84881 PHY driver") Signed-off-by: Christophe JAILLET Link: https://patch.msgid.link/3e1755b0c40340d00e089d6adae5bca2f8c79e53.1727982168.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski --- drivers/net/phy/bcm84881.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/bcm84881.c b/drivers/net/phy/bcm84881.c index f1d47c2640585b..97da3aee49422c 100644 --- a/drivers/net/phy/bcm84881.c +++ b/drivers/net/phy/bcm84881.c @@ -132,7 +132,7 @@ static int bcm84881_aneg_done(struct phy_device *phydev) bmsr = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_C22 + MII_BMSR); if (bmsr < 0) - return val; + return bmsr; return !!(val & MDIO_AN_STAT1_COMPLETE) && !!(bmsr & BMSR_ANEGCOMPLETE); @@ -158,7 +158,7 @@ static int bcm84881_read_status(struct phy_device *phydev) bmsr = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_C22 + MII_BMSR); if (bmsr < 0) - return val; + return bmsr; phydev->autoneg_complete = !!(val & MDIO_AN_STAT1_COMPLETE) && !!(bmsr & BMSR_ANEGCOMPLETE); From 631083143315d1b192bd7d915b967b37819e88ea Mon Sep 17 00:00:00 2001 From: Ignat Korchagin Date: Thu, 3 Oct 2024 18:01:51 +0100 Subject: [PATCH 32/87] net: explicitly clear the sk pointer, when pf->create fails We have recently noticed the exact same KASAN splat as in commit 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails"). The problem is that commit did not fully address the problem, as some pf->create implementations do not use sk_common_release in their error paths. For example, we can use the same reproducer as in the above commit, but changing ping to arping. arping uses AF_PACKET socket and if packet_create fails, it will just sk_free the allocated sk object. While we could chase all the pf->create implementations and make sure they NULL the freed sk object on error from the socket, we can't guarantee future protocols will not make the same mistake. So it is easier to just explicitly NULL the sk pointer upon return from pf->create in __sock_create. We do know that pf->create always releases the allocated sk object on error, so if the pointer is not NULL, it is definitely dangling. Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") Signed-off-by: Ignat Korchagin Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241003170151.69445-1-ignat@cloudflare.com Signed-off-by: Jakub Kicinski --- net/socket.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/socket.c b/net/socket.c index 601ad74930efb0..042451f01c6520 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1574,8 +1574,13 @@ int __sock_create(struct net *net, int family, int type, int protocol, rcu_read_unlock(); err = pf->create(net, sock, protocol, kern); - if (err < 0) + if (err < 0) { + /* ->create should release the allocated sock->sk object on error + * but it may leave the dangling pointer + */ + sock->sk = NULL; goto out_module_put; + } /* * Now to bump the refcnt of the [loadable] module that owns this From 5c14e51d2d7df49fe0d4e64a12c58d2542f452ff Mon Sep 17 00:00:00 2001 From: Anatolij Gustschin Date: Fri, 4 Oct 2024 13:36:54 +0200 Subject: [PATCH 33/87] net: dsa: lan9303: ensure chip reset and wait for READY status Accessing device registers seems to be not reliable, the chip revision is sometimes detected wrongly (0 instead of expected 1). Ensure that the chip reset is performed via reset GPIO and then wait for 'Device Ready' status in HW_CFG register before doing any register initializations. Cc: stable@vger.kernel.org Fixes: a1292595e006 ("net: dsa: add new DSA switch driver for the SMSC-LAN9303") Signed-off-by: Anatolij Gustschin [alex: reworked using read_poll_timeout()] Signed-off-by: Alexander Sverdlin Reviewed-by: Vladimir Oltean Link: https://patch.msgid.link/20241004113655.3436296-1-alexander.sverdlin@siemens.com Signed-off-by: Jakub Kicinski --- drivers/net/dsa/lan9303-core.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c index 268949939636a0..d246f95d57ecf8 100644 --- a/drivers/net/dsa/lan9303-core.c +++ b/drivers/net/dsa/lan9303-core.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -839,6 +840,8 @@ static void lan9303_handle_reset(struct lan9303 *chip) if (!chip->reset_gpio) return; + gpiod_set_value_cansleep(chip->reset_gpio, 1); + if (chip->reset_duration != 0) msleep(chip->reset_duration); @@ -864,8 +867,34 @@ static int lan9303_disable_processing(struct lan9303 *chip) static int lan9303_check_device(struct lan9303 *chip) { int ret; + int err; u32 reg; + /* In I2C-managed configurations this polling loop will clash with + * switch's reading of EEPROM right after reset and this behaviour is + * not configurable. While lan9303_read() already has quite long retry + * timeout, seems not all cases are being detected as arbitration error. + * + * According to datasheet, EEPROM loader has 30ms timeout (in case of + * missing EEPROM). + * + * Loading of the largest supported EEPROM is expected to take at least + * 5.9s. + */ + err = read_poll_timeout(lan9303_read, ret, + !ret && reg & LAN9303_HW_CFG_READY, + 20000, 6000000, false, + chip->regmap, LAN9303_HW_CFG, ®); + if (ret) { + dev_err(chip->dev, "failed to read HW_CFG reg: %pe\n", + ERR_PTR(ret)); + return ret; + } + if (err) { + dev_err(chip->dev, "HW_CFG not ready: 0x%08x\n", reg); + return err; + } + ret = lan9303_read(chip->regmap, LAN9303_CHIP_REV, ®); if (ret) { dev_err(chip->dev, "failed to read chip revision register: %d\n", From 5546da79e6cc5bb3324bf25688ed05498fd3f86d Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 4 Oct 2024 07:21:15 -0700 Subject: [PATCH 34/87] Revert "net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled" This reverts commit b514c47ebf41a6536551ed28a05758036e6eca7c. The commit describes that we don't have to sync the page when recycling, and it tries to optimize that case. But we do need to sync after allocation. Recycling side should be changed to pass the right sync size instead. Fixes: b514c47ebf41 ("net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled") Reported-by: Jon Hunter Link: https://lore.kernel.org/20241004070846.2502e9ea@kernel.org Reviewed-by: Jacob Keller Reviewed-by: Furong Xu <0x1207@gmail.com> Link: https://patch.msgid.link/20241004142115.910876-1-kuba@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index e2140482270a88..d3895d7eecfc52 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -2035,7 +2035,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv, rx_q->queue_index = queue; rx_q->priv_data = priv; - pp_params.flags = PP_FLAG_DMA_MAP | (xdp_prog ? PP_FLAG_DMA_SYNC_DEV : 0); + pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; pp_params.pool_size = dma_conf->dma_rx_size; num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE); pp_params.order = ilog2(num_pages); From 83211ae1640516accae645de82f5a0a142676897 Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Thu, 3 Oct 2024 20:53:15 +0200 Subject: [PATCH 35/87] net: ethernet: adi: adin1110: Fix some error handling path in adin1110_read_fifo() If 'frame_size' is too small or if 'round_len' is an error code, it is likely that an error code should be returned to the caller. Actually, 'ret' is likely to be 0, so if one of these sanity checks fails, 'success' is returned. Return -EINVAL instead. Fixes: bc93e19d088b ("net: ethernet: adi: Add ADIN1110 support") Signed-off-by: Christophe JAILLET Link: https://patch.msgid.link/8ff73b40f50d8fa994a454911b66adebce8da266.1727981562.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/adi/adin1110.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/adi/adin1110.c b/drivers/net/ethernet/adi/adin1110.c index a98b3139606ade..68fad5575fd4f8 100644 --- a/drivers/net/ethernet/adi/adin1110.c +++ b/drivers/net/ethernet/adi/adin1110.c @@ -318,11 +318,11 @@ static int adin1110_read_fifo(struct adin1110_port_priv *port_priv) * from the ADIN1110 frame header. */ if (frame_size < ADIN1110_FRAME_HEADER_LEN + ADIN1110_FEC_LEN) - return ret; + return -EINVAL; round_len = adin1110_round_len(frame_size); if (round_len < 0) - return ret; + return -EINVAL; frame_size_no_fcs = frame_size - ADIN1110_FRAME_HEADER_LEN - ADIN1110_FEC_LEN; memset(priv->data, 0, ADIN1110_RD_HEADER_LEN); From f50b5d74c68e551667e265123659b187a30fe3a5 Mon Sep 17 00:00:00 2001 From: Christian Marangi Date: Fri, 4 Oct 2024 20:27:58 +0200 Subject: [PATCH 36/87] net: phy: Remove LED entry from LEDs list on unregister Commit c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct ordering") correctly fixed a problem with using devm_ but missed removing the LED entry from the LEDs list. This cause kernel panic on specific scenario where the port for the PHY is torn down and up and the kmod for the PHY is removed. On setting the port down the first time, the assosiacted LEDs are correctly unregistered. The associated kmod for the PHY is now removed. The kmod is now added again and the port is now put up, the associated LED are registered again. On putting the port down again for the second time after these step, the LED list now have 4 elements. With the first 2 already unregistered previously and the 2 new one registered again. This cause a kernel panic as the first 2 element should have been removed. Fix this by correctly removing the element when LED is unregistered. Reported-by: Daniel Golle Tested-by: Daniel Golle Cc: stable@vger.kernel.org Fixes: c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct ordering") Signed-off-by: Christian Marangi Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20241004182759.14032-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/phy/phy_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 560e338b307a40..499797646580e3 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -3326,10 +3326,11 @@ static __maybe_unused int phy_led_hw_is_supported(struct led_classdev *led_cdev, static void phy_leds_unregister(struct phy_device *phydev) { - struct phy_led *phyled; + struct phy_led *phyled, *tmp; - list_for_each_entry(phyled, &phydev->leds, list) { + list_for_each_entry_safe(phyled, tmp, &phydev->leds, list) { led_classdev_unregister(&phyled->led_cdev); + list_del(&phyled->list); } } From 3dc6e998d18bfba6e0dc979d3cc68eba98dfeef7 Mon Sep 17 00:00:00 2001 From: Lorenzo Bianconi Date: Fri, 4 Oct 2024 15:51:26 +0200 Subject: [PATCH 37/87] net: airoha: Update tx cpu dma ring idx at the end of xmit loop Move the tx cpu dma ring index update out of transmit loop of airoha_dev_xmit routine in order to not start transmitting the packet before it is fully DMA mapped (e.g. fragmented skbs). Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Reported-by: Felix Fietkau Signed-off-by: Lorenzo Bianconi Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241004-airoha-eth-7581-mapping-fix-v1-1-8e4279ab1812@kernel.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mediatek/airoha_eth.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mediatek/airoha_eth.c b/drivers/net/ethernet/mediatek/airoha_eth.c index 930f180688e5ca..2c26eb18528372 100644 --- a/drivers/net/ethernet/mediatek/airoha_eth.c +++ b/drivers/net/ethernet/mediatek/airoha_eth.c @@ -2471,10 +2471,6 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb, e->dma_addr = addr; e->dma_len = len; - airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), - TX_RING_CPU_IDX_MASK, - FIELD_PREP(TX_RING_CPU_IDX_MASK, index)); - data = skb_frag_address(frag); len = skb_frag_size(frag); } @@ -2483,6 +2479,11 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb, q->queued += i; skb_tx_timestamp(skb); + if (!netdev_xmit_more()) + airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), + TX_RING_CPU_IDX_MASK, + FIELD_PREP(TX_RING_CPU_IDX_MASK, q->head)); + if (q->ndesc - q->queued < q->free_thr) netif_tx_stop_queue(txq); From 47f9605484a89ea14c41f0aa0e9294b7b94d64c0 Mon Sep 17 00:00:00 2001 From: Nicolas Pitre Date: Fri, 4 Oct 2024 00:10:33 -0400 Subject: [PATCH 38/87] net: ethernet: ti: am65-cpsw: prevent WARN_ON upon module removal In am65_cpsw_nuss_remove(), move the call to am65_cpsw_unregister_devlink() after am65_cpsw_nuss_cleanup_ndev() to avoid triggering the WARN_ON(devlink_port->type != DEVLINK_PORT_TYPE_NOTSET) in devl_port_unregister(). Makes it coherent with usage in m65_cpsw_nuss_register_ndevs()'s cleanup path. Fixes: 58356eb31d60 ("net: ti: am65-cpsw-nuss: Add devlink support") Signed-off-by: Nicolas Pitre Reviewed-by: Roger Quadros Signed-off-by: Paolo Abeni --- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c index d253727b160f5f..3fec2cd45f5ef8 100644 --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c @@ -3652,13 +3652,13 @@ static void am65_cpsw_nuss_remove(struct platform_device *pdev) return; } - am65_cpsw_unregister_devlink(common); am65_cpsw_unregister_notifiers(common); /* must unregister ndevs here because DD release_driver routine calls * dma_deconfigure(dev) before devres_release_all(dev) */ am65_cpsw_nuss_cleanup_ndev(common); + am65_cpsw_unregister_devlink(common); am65_cpsw_nuss_phylink_cleanup(common); am65_cpts_release(common->cpts); am65_cpsw_disable_serdes_phy(common); From 03c96bc9d3d2d5991ed455d70a67cbafbbc50063 Mon Sep 17 00:00:00 2001 From: Nicolas Pitre Date: Fri, 4 Oct 2024 00:10:34 -0400 Subject: [PATCH 39/87] net: ethernet: ti: am65-cpsw: avoid devm_alloc_etherdev, fix module removal Usage of devm_alloc_etherdev_mqs() conflicts with am65_cpsw_nuss_cleanup_ndev() as the same struct net_device instances get unregistered twice. Switch to alloc_etherdev_mqs() and make sure am65_cpsw_nuss_cleanup_ndev() unregisters and frees those net_device instances properly. With this, it is finally possible to rmmod the driver without oopsing the kernel. Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Nicolas Pitre Reviewed-by: Roger Quadros Signed-off-by: Paolo Abeni --- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c index 3fec2cd45f5ef8..0520e9f4bea700 100644 --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c @@ -2744,10 +2744,9 @@ am65_cpsw_nuss_init_port_ndev(struct am65_cpsw_common *common, u32 port_idx) return 0; /* alloc netdev */ - port->ndev = devm_alloc_etherdev_mqs(common->dev, - sizeof(struct am65_cpsw_ndev_priv), - AM65_CPSW_MAX_QUEUES, - AM65_CPSW_MAX_QUEUES); + port->ndev = alloc_etherdev_mqs(sizeof(struct am65_cpsw_ndev_priv), + AM65_CPSW_MAX_QUEUES, + AM65_CPSW_MAX_QUEUES); if (!port->ndev) { dev_err(dev, "error allocating slave net_device %u\n", port->port_id); @@ -2868,8 +2867,12 @@ static void am65_cpsw_nuss_cleanup_ndev(struct am65_cpsw_common *common) for (i = 0; i < common->port_num; i++) { port = &common->ports[i]; - if (port->ndev && port->ndev->reg_state == NETREG_REGISTERED) + if (!port->ndev) + continue; + if (port->ndev->reg_state == NETREG_REGISTERED) unregister_netdev(port->ndev); + free_netdev(port->ndev); + port->ndev = NULL; } } @@ -3613,16 +3616,17 @@ static int am65_cpsw_nuss_probe(struct platform_device *pdev) ret = am65_cpsw_nuss_init_ndevs(common); if (ret) - goto err_free_phylink; + goto err_ndevs_clear; ret = am65_cpsw_nuss_register_ndevs(common); if (ret) - goto err_free_phylink; + goto err_ndevs_clear; pm_runtime_put(dev); return 0; -err_free_phylink: +err_ndevs_clear: + am65_cpsw_nuss_cleanup_ndev(common); am65_cpsw_nuss_phylink_cleanup(common); am65_cpts_release(common->cpts); err_of_clear: From 42fb3acf6826c6764ba79feb6e15229b43fd2f9f Mon Sep 17 00:00:00 2001 From: Jonas Gorski Date: Fri, 4 Oct 2024 10:47:17 +0200 Subject: [PATCH 40/87] net: dsa: b53: fix jumbo frame mtu check JMS_MIN_SIZE is the full ethernet frame length, while mtu is just the data payload size. Comparing these two meant that mtus between 1500 and 1518 did not trigger enabling jumbo frames. So instead compare the set mtu ETH_DATA_LEN, which is equal to JMS_MIN_SIZE - ETH_HLEN - ETH_FCS_LEN; Also do a check that the requested mtu is actually greater than the minimum length, else we do not need to enable jumbo frames. In practice this only introduced a very small range of mtus that did not work properly. Newer chips allow 2000 byte large frames by default, and older chips allow 1536 bytes long, which is equivalent to an mtu of 1514. So effectivly only mtus of 1515~1517 were broken. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski Reviewed-by: Florian Fainelli Signed-off-by: Paolo Abeni --- drivers/net/dsa/b53/b53_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 0783fc121bbbf9..57df00ad9dd4ce 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -2259,7 +2259,7 @@ static int b53_change_mtu(struct dsa_switch *ds, int port, int mtu) if (!dsa_is_cpu_port(ds, port)) return 0; - enable_jumbo = (mtu >= JMS_MIN_SIZE); + enable_jumbo = (mtu > ETH_DATA_LEN); allow_10_100 = (dev->chip_id == BCM583XX_DEVICE_ID); return b53_set_jumbo(dev, enable_jumbo, allow_10_100); From 680a8217dc00dc7e7da57888b3c053289b60eb2b Mon Sep 17 00:00:00 2001 From: Jonas Gorski Date: Fri, 4 Oct 2024 10:47:18 +0200 Subject: [PATCH 41/87] net: dsa: b53: fix max MTU for 1g switches JMS_MAX_SIZE is the ethernet frame length, not the MTU, which is payload without ethernet headers. According to the datasheets maximum supported frame length for most gigabyte swithes is 9720 bytes, so convert that to the expected MTU when using VLAN tagged frames. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski Reviewed-by: Florian Fainelli Signed-off-by: Paolo Abeni --- drivers/net/dsa/b53/b53_common.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 57df00ad9dd4ce..6fed3eb15ad9b2 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "b53_regs.h" @@ -224,6 +225,8 @@ static const struct b53_mib_desc b53_mibs_58xx[] = { #define B53_MIBS_58XX_SIZE ARRAY_SIZE(b53_mibs_58xx) +#define B53_MAX_MTU (9720 - ETH_HLEN - VLAN_HLEN - ETH_FCS_LEN) + static int b53_do_vlan_op(struct b53_device *dev, u8 op) { unsigned int i; @@ -2267,7 +2270,7 @@ static int b53_change_mtu(struct dsa_switch *ds, int port, int mtu) static int b53_get_max_mtu(struct dsa_switch *ds, int port) { - return JMS_MAX_SIZE; + return B53_MAX_MTU; } static const struct phylink_mac_ops b53_phylink_mac_ops = { From ca8c1f71c10193c270f772d70d34b15ad765d6a8 Mon Sep 17 00:00:00 2001 From: Jonas Gorski Date: Fri, 4 Oct 2024 10:47:19 +0200 Subject: [PATCH 42/87] net: dsa: b53: fix max MTU for BCM5325/BCM5365 BCM5325/BCM5365 do not support jumbo frames, so we should not report a jumbo frame mtu for them. But they do support so called "oversized" frames up to 1536 bytes long by default, so report an appropriate MTU. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski Reviewed-by: Florian Fainelli Signed-off-by: Paolo Abeni --- drivers/net/dsa/b53/b53_common.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 6fed3eb15ad9b2..e8b20bfa8b83ea 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -225,6 +225,7 @@ static const struct b53_mib_desc b53_mibs_58xx[] = { #define B53_MIBS_58XX_SIZE ARRAY_SIZE(b53_mibs_58xx) +#define B53_MAX_MTU_25 (1536 - ETH_HLEN - VLAN_HLEN - ETH_FCS_LEN) #define B53_MAX_MTU (9720 - ETH_HLEN - VLAN_HLEN - ETH_FCS_LEN) static int b53_do_vlan_op(struct b53_device *dev, u8 op) @@ -2270,6 +2271,11 @@ static int b53_change_mtu(struct dsa_switch *ds, int port, int mtu) static int b53_get_max_mtu(struct dsa_switch *ds, int port) { + struct b53_device *dev = ds->priv; + + if (is5325(dev) || is5365(dev)) + return B53_MAX_MTU_25; + return B53_MAX_MTU; } From e4b294f88a32438baf31762441f3dd1c996778be Mon Sep 17 00:00:00 2001 From: Jonas Gorski Date: Fri, 4 Oct 2024 10:47:20 +0200 Subject: [PATCH 43/87] net: dsa: b53: allow lower MTUs on BCM5325/5365 While BCM5325/5365 do not support jumbo frames, they do support slightly oversized frames, so do not error out if requesting a supported MTU for them. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski Reviewed-by: Florian Fainelli Signed-off-by: Paolo Abeni --- drivers/net/dsa/b53/b53_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index e8b20bfa8b83ea..5b83f9b6cdac3d 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -2258,7 +2258,7 @@ static int b53_change_mtu(struct dsa_switch *ds, int port, int mtu) bool allow_10_100; if (is5325(dev) || is5365(dev)) - return -EOPNOTSUPP; + return 0; if (!dsa_is_cpu_port(ds, port)) return 0; From 2f3dcd0d39affe5b9ba1c351ce0e270c8bdd5109 Mon Sep 17 00:00:00 2001 From: Jonas Gorski Date: Fri, 4 Oct 2024 10:47:21 +0200 Subject: [PATCH 44/87] net: dsa: b53: fix jumbo frames on 10/100 ports All modern chips support and need the 10_100 bit set for supporting jumbo frames on 10/100 ports, so instead of enabling it only for 583XX enable it for everything except bcm63xx, where the bit is writeable, but does nothing. Tested on BCM53115, where jumbo frames were dropped at 10/100 speeds without the bit set. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski Reviewed-by: Florian Fainelli Signed-off-by: Paolo Abeni --- drivers/net/dsa/b53/b53_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index 5b83f9b6cdac3d..c39cb119e760db 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -2264,7 +2264,7 @@ static int b53_change_mtu(struct dsa_switch *ds, int port, int mtu) return 0; enable_jumbo = (mtu > ETH_DATA_LEN); - allow_10_100 = (dev->chip_id == BCM583XX_DEVICE_ID); + allow_10_100 = !is63xx(dev); return b53_set_jumbo(dev, enable_jumbo, allow_10_100); } From 9c4beb2dfebab4e81f7aabde03ce2918e358e841 Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Sat, 5 Oct 2024 07:29:40 +0200 Subject: [PATCH 45/87] selftests: net: add msg_oob to gitignore This executable is missing from the corresponding gitignore file. Add msg_oob to the net gitignore list. Signed-off-by: Javier Carrasco Link: https://patch.msgid.link/20241005-net-selftests-gitignore-v2-1-3a0b2876394a@gmail.com Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 1c04c780db6674..217d8b7a736502 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -16,6 +16,7 @@ ipsec ipv6_flowlabel ipv6_flowlabel_mgr log.txt +msg_oob msg_zerocopy ncdevmem nettest From 4227b50cff0586d6f92b20ce9672dbe881105ea7 Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Sat, 5 Oct 2024 07:29:41 +0200 Subject: [PATCH 46/87] selftests: net: rds: add include.sh to EXTRA_CLEAN The include.sh file is generated when building the net/rds selftests, but there is no rule to delete it with the clean target. Add the file to EXTRA_CLEAN in order to remove it when required. Reviewed-by: Allison Henderson Signed-off-by: Javier Carrasco Link: https://patch.msgid.link/20241005-net-selftests-gitignore-v2-2-3a0b2876394a@gmail.com Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/rds/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/rds/Makefile b/tools/testing/selftests/net/rds/Makefile index cf30307a829b23..1803c39dbacb14 100644 --- a/tools/testing/selftests/net/rds/Makefile +++ b/tools/testing/selftests/net/rds/Makefile @@ -8,6 +8,6 @@ TEST_PROGS := run.sh \ TEST_FILES := include.sh -EXTRA_CLEAN := /tmp/rds_logs +EXTRA_CLEAN := /tmp/rds_logs include.sh include ../../lib.mk From 0e43a5a7b253ed3764929a43778d3c684092a277 Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Sat, 5 Oct 2024 07:29:42 +0200 Subject: [PATCH 47/87] selftests: net: rds: add gitignore file for include.sh The generated include.sh should be ignored by git. Create a new gitignore and add the file to the list. Reviewed-by: Allison Henderson Signed-off-by: Javier Carrasco Link: https://patch.msgid.link/20241005-net-selftests-gitignore-v2-3-3a0b2876394a@gmail.com Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/rds/.gitignore | 1 + 1 file changed, 1 insertion(+) create mode 100644 tools/testing/selftests/net/rds/.gitignore diff --git a/tools/testing/selftests/net/rds/.gitignore b/tools/testing/selftests/net/rds/.gitignore new file mode 100644 index 00000000000000..1c6f04e2aa11c6 --- /dev/null +++ b/tools/testing/selftests/net/rds/.gitignore @@ -0,0 +1 @@ +include.sh From 1fd9e4f257827d939cc627541f12fc4bdd979eb1 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Sat, 5 Oct 2024 14:56:00 -0700 Subject: [PATCH 48/87] selftests: make kselftest-clean remove libynl outputs Starting with 6.12 commit 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP") kselftest-all creates additional outputs that kselftest-clean does not cleanup: $ make defconfig $ make kselftest-all $ make kselftest-clean $ git clean -ndxf | grep tools/net Would remove tools/net/ynl/lib/__pycache__/ Would remove tools/net/ynl/lib/ynl.a Would remove tools/net/ynl/lib/ynl.d Would remove tools/net/ynl/lib/ynl.o Make kselftest-clean remove the newly added net/ynl outputs. Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP") Signed-off-by: Greg Thelen Reviewed-by: Muhammad Usama Anjum Reviewed-by: Guenter Roeck Link: https://patch.msgid.link/20241005215600.852260-1-gthelen@google.com Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/ynl.mk | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/net/ynl.mk b/tools/testing/selftests/net/ynl.mk index 59cb26cf3f7384..1ef24119def0ac 100644 --- a/tools/testing/selftests/net/ynl.mk +++ b/tools/testing/selftests/net/ynl.mk @@ -19,3 +19,7 @@ $(YNL_OUTPUTS): CFLAGS += \ $(OUTPUT)/libynl.a: $(Q)$(MAKE) -C $(top_srcdir)/tools/net/ynl GENS="$(YNL_GENS)" libynl.a $(Q)cp $(top_srcdir)/tools/net/ynl/libynl.a $(OUTPUT)/libynl.a + +EXTRA_CLEAN += \ + $(top_srcdir)/tools/net/ynl/lib/__pycache__ \ + $(top_srcdir)/tools/net/ynl/lib/*.[ado] From b972060a47780aa2d46441e06b354156455cc877 Mon Sep 17 00:00:00 2001 From: Marcin Szycik Date: Tue, 24 Sep 2024 12:04:23 +0200 Subject: [PATCH 49/87] ice: Fix entering Safe Mode If DDP package is missing or corrupted, the driver should enter Safe Mode. Instead, an error is returned and probe fails. To fix this, don't exit init if ice_init_ddp_config() returns an error. Repro: * Remove or rename DDP package (/lib/firmware/intel/ice/ddp/ice.pkg) * Load ice Fixes: cc5776fe1832 ("ice: Enable switching default Tx scheduler topology") Reviewed-by: Przemek Kitszel Signed-off-by: Marcin Szycik Reviewed-by: Brett Creeley Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index fbab72fab79ca7..da1352dc26affb 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -4767,14 +4767,12 @@ int ice_init_dev(struct ice_pf *pf) ice_init_feature_support(pf); err = ice_init_ddp_config(hw, pf); - if (err) - return err; /* if ice_init_ddp_config fails, ICE_FLAG_ADV_FEATURES bit won't be * set in pf->state, which will cause ice_is_safe_mode to return * true */ - if (ice_is_safe_mode(pf)) { + if (err || ice_is_safe_mode(pf)) { /* we already got function/device capabilities but these don't * reflect what the driver needs to do in safe mode. Instead of * adding conditional logic everywhere to ignore these From 8e60dbcbaaa177dacef55a61501790e201bf8c88 Mon Sep 17 00:00:00 2001 From: Marcin Szycik Date: Tue, 24 Sep 2024 12:04:24 +0200 Subject: [PATCH 50/87] ice: Fix netif_is_ice() in Safe Mode netif_is_ice() works by checking the pointer to netdev ops. However, it only checks for the default ice_netdev_ops, not ice_netdev_safe_mode_ops, so in Safe Mode it always returns false, which is unintuitive. While it doesn't look like netif_is_ice() is currently being called anywhere in Safe Mode, this could change and potentially lead to unexpected behaviour. Fixes: df006dd4b1dc ("ice: Add initial support framework for LAG") Reviewed-by: Przemek Kitszel Signed-off-by: Marcin Szycik Reviewed-by: Brett Creeley Tested-by: Sujai Buvaneswaran Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index da1352dc26affb..09d1a4eb571622 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -87,7 +87,8 @@ ice_indr_setup_tc_cb(struct net_device *netdev, struct Qdisc *sch, bool netif_is_ice(const struct net_device *dev) { - return dev && (dev->netdev_ops == &ice_netdev_ops); + return dev && (dev->netdev_ops == &ice_netdev_ops || + dev->netdev_ops == &ice_netdev_safe_mode_ops); } /** From fbcb968a98ac0b71f5a2bda2751d7a32d201f90d Mon Sep 17 00:00:00 2001 From: Wojciech Drewek Date: Fri, 27 Sep 2024 14:38:01 +0200 Subject: [PATCH 51/87] ice: Flush FDB entries before reset Triggering the reset while in switchdev mode causes errors[1]. Rules are already removed by this time because switch content is flushed in case of the reset. This means that rules were deleted from HW but SW still thinks they exist so when we get SWITCHDEV_FDB_DEL_TO_DEVICE notification we try to delete not existing rule. We can avoid these errors by clearing the rules early in the reset flow before they are removed from HW. Switchdev API will get notified that the rule was removed so we won't get SWITCHDEV_FDB_DEL_TO_DEVICE notification. Remove unnecessary ice_clear_sw_switch_recipes. [1] ice 0000:01:00.0: Failed to delete FDB forward rule, err: -2 ice 0000:01:00.0: Failed to delete FDB guard rule, err: -2 Fixes: 7c945a1a8e5f ("ice: Switchdev FDB events support") Reviewed-by: Mateusz Polchlopek Signed-off-by: Wojciech Drewek Tested-by: Sujai Buvaneswaran Signed-off-by: Tony Nguyen --- .../net/ethernet/intel/ice/ice_eswitch_br.c | 5 +++- .../net/ethernet/intel/ice/ice_eswitch_br.h | 1 + drivers/net/ethernet/intel/ice/ice_main.c | 24 +++---------------- 3 files changed, 8 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch_br.c b/drivers/net/ethernet/intel/ice/ice_eswitch_br.c index f5aceb32bf4dd2..cccb7ddf61c975 100644 --- a/drivers/net/ethernet/intel/ice/ice_eswitch_br.c +++ b/drivers/net/ethernet/intel/ice/ice_eswitch_br.c @@ -582,10 +582,13 @@ ice_eswitch_br_switchdev_event(struct notifier_block *nb, return NOTIFY_DONE; } -static void ice_eswitch_br_fdb_flush(struct ice_esw_br *bridge) +void ice_eswitch_br_fdb_flush(struct ice_esw_br *bridge) { struct ice_esw_br_fdb_entry *entry, *tmp; + if (!bridge) + return; + list_for_each_entry_safe(entry, tmp, &bridge->fdb_list, list) ice_eswitch_br_fdb_entry_notify_and_cleanup(bridge, entry); } diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch_br.h b/drivers/net/ethernet/intel/ice/ice_eswitch_br.h index c15c7344d7f853..66a2c804338f0d 100644 --- a/drivers/net/ethernet/intel/ice/ice_eswitch_br.h +++ b/drivers/net/ethernet/intel/ice/ice_eswitch_br.h @@ -117,5 +117,6 @@ void ice_eswitch_br_offloads_deinit(struct ice_pf *pf); int ice_eswitch_br_offloads_init(struct ice_pf *pf); +void ice_eswitch_br_fdb_flush(struct ice_esw_br *bridge); #endif /* _ICE_ESWITCH_BR_H_ */ diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 09d1a4eb571622..b1e7727b8677f9 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -521,25 +521,6 @@ static void ice_pf_dis_all_vsi(struct ice_pf *pf, bool locked) pf->vf_agg_node[node].num_vsis = 0; } -/** - * ice_clear_sw_switch_recipes - clear switch recipes - * @pf: board private structure - * - * Mark switch recipes as not created in sw structures. There are cases where - * rules (especially advanced rules) need to be restored, either re-read from - * hardware or added again. For example after the reset. 'recp_created' flag - * prevents from doing that and need to be cleared upfront. - */ -static void ice_clear_sw_switch_recipes(struct ice_pf *pf) -{ - struct ice_sw_recipe *recp; - u8 i; - - recp = pf->hw.switch_info->recp_list; - for (i = 0; i < ICE_MAX_NUM_RECIPES; i++) - recp[i].recp_created = false; -} - /** * ice_prepare_for_reset - prep for reset * @pf: board private structure @@ -576,8 +557,9 @@ ice_prepare_for_reset(struct ice_pf *pf, enum ice_reset_req reset_type) mutex_unlock(&pf->vfs.table_lock); if (ice_is_eswitch_mode_switchdev(pf)) { - if (reset_type != ICE_RESET_PFR) - ice_clear_sw_switch_recipes(pf); + rtnl_lock(); + ice_eswitch_br_fdb_flush(pf->eswitch.br_offloads->bridge); + rtnl_unlock(); } /* release ADQ specific HW and SW resources */ From bce9af1b030bf59d51bbabf909a3ef164787e44e Mon Sep 17 00:00:00 2001 From: Marcin Szycik Date: Fri, 27 Sep 2024 17:15:40 +0200 Subject: [PATCH 52/87] ice: Fix increasing MSI-X on VF Increasing MSI-X value on a VF leads to invalid memory operations. This is caused by not reallocating some arrays. Reproducer: modprobe ice echo 0 > /sys/bus/pci/devices/$PF_PCI/sriov_drivers_autoprobe echo 1 > /sys/bus/pci/devices/$PF_PCI/sriov_numvfs echo 17 > /sys/bus/pci/devices/$VF0_PCI/sriov_vf_msix_count Default MSI-X is 16, so 17 and above triggers this issue. KASAN reports: BUG: KASAN: slab-out-of-bounds in ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice] Read of size 8 at addr ffff8888b937d180 by task bash/28433 (...) Call Trace: (...) ? ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice] kasan_report+0xed/0x120 ? ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice] ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice] ice_vsi_cfg_def+0x3360/0x4770 [ice] ? mutex_unlock+0x83/0xd0 ? __pfx_ice_vsi_cfg_def+0x10/0x10 [ice] ? __pfx_ice_remove_vsi_lkup_fltr+0x10/0x10 [ice] ice_vsi_cfg+0x7f/0x3b0 [ice] ice_vf_reconfig_vsi+0x114/0x210 [ice] ice_sriov_set_msix_vec_count+0x3d0/0x960 [ice] sriov_vf_msix_count_store+0x21c/0x300 (...) Allocated by task 28201: (...) ice_vsi_cfg_def+0x1c8e/0x4770 [ice] ice_vsi_cfg+0x7f/0x3b0 [ice] ice_vsi_setup+0x179/0xa30 [ice] ice_sriov_configure+0xcaa/0x1520 [ice] sriov_numvfs_store+0x212/0x390 (...) To fix it, use ice_vsi_rebuild() instead of ice_vf_reconfig_vsi(). This causes the required arrays to be reallocated taking the new queue count into account (ice_vsi_realloc_stat_arrays()). Set req_txq and req_rxq before ice_vsi_rebuild(), so that realloc uses the newly set queue count. Additionally, ice_vsi_rebuild() does not remove VSI filters (ice_fltr_remove_all()), so ice_vf_init_host_cfg() is no longer necessary. Reported-by: Jacob Keller Fixes: 2a2cb4c6c181 ("ice: replace ice_vf_recreate_vsi() with ice_vf_reconfig_vsi()") Reviewed-by: Michal Swiatkowski Signed-off-by: Marcin Szycik Reviewed-by: Simon Horman Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_sriov.c | 11 ++++++++--- drivers/net/ethernet/intel/ice/ice_vf_lib.c | 2 +- drivers/net/ethernet/intel/ice/ice_vf_lib_private.h | 1 - 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c index c2d6b2a144e940..91cb393f616f2b 100644 --- a/drivers/net/ethernet/intel/ice/ice_sriov.c +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c @@ -1121,7 +1121,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) if (vf->first_vector_idx < 0) goto unroll; - if (ice_vf_reconfig_vsi(vf) || ice_vf_init_host_cfg(vf, vsi)) { + vsi->req_txq = queues; + vsi->req_rxq = queues; + + if (ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT)) { /* Try to rebuild with previous values */ needs_rebuild = true; goto unroll; @@ -1150,8 +1153,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) } if (needs_rebuild) { - ice_vf_reconfig_vsi(vf); - ice_vf_init_host_cfg(vf, vsi); + vsi->req_txq = prev_queues; + vsi->req_rxq = prev_queues; + + ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT); } ice_ena_vf_mappings(vf); diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c index 749a08ccf2678b..8c434689e3f78e 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -256,7 +256,7 @@ static void ice_vf_pre_vsi_rebuild(struct ice_vf *vf) * * It brings the VSI down and then reconfigures it with the hardware. */ -int ice_vf_reconfig_vsi(struct ice_vf *vf) +static int ice_vf_reconfig_vsi(struct ice_vf *vf) { struct ice_vsi *vsi = ice_get_vf_vsi(vf); struct ice_pf *pf = vf->pf; diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib_private.h b/drivers/net/ethernet/intel/ice/ice_vf_lib_private.h index 91ba7fe0eaee1a..0c7e77c0a09fa6 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib_private.h +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib_private.h @@ -23,7 +23,6 @@ #warning "Only include ice_vf_lib_private.h in CONFIG_PCI_IOV virtualization files" #endif -int ice_vf_reconfig_vsi(struct ice_vf *vf); void ice_initialize_vf_entry(struct ice_vf *vf); void ice_dis_vf_qs(struct ice_vf *vf); int ice_check_vf_init(struct ice_vf *vf); From dac6c7b3d33756d6ce09f00a96ea2ecd79fae9fb Mon Sep 17 00:00:00 2001 From: Aleksandr Loktionov Date: Mon, 23 Sep 2024 11:12:19 +0200 Subject: [PATCH 53/87] i40e: Fix macvlan leak by synchronizing access to mac_filter_hash This patch addresses a macvlan leak issue in the i40e driver caused by concurrent access to vsi->mac_filter_hash. The leak occurs when multiple threads attempt to modify the mac_filter_hash simultaneously, leading to inconsistent state and potential memory leaks. To fix this, we now wrap the calls to i40e_del_mac_filter() and zeroing vf->default_lan_addr.addr with spin_lock/unlock_bh(&vsi->mac_filter_hash_lock), ensuring atomic operations and preventing concurrent access. Additionally, we add lockdep_assert_held(&vsi->mac_filter_hash_lock) in i40e_add_mac_filter() to help catch similar issues in the future. Reproduction steps: 1. Spawn VFs and configure port vlan on them. 2. Trigger concurrent macvlan operations (e.g., adding and deleting portvlan and/or mac filters). 3. Observe the potential memory leak and inconsistent state in the mac_filter_hash. This synchronization ensures the integrity of the mac_filter_hash and prevents the described leak. Fixes: fed0d9f13266 ("i40e: Fix VF's MAC Address change on VM") Reviewed-by: Arkadiusz Kubalewski Signed-off-by: Aleksandr Loktionov Reviewed-by: Simon Horman Tested-by: Rafal Romanowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 03205eb9f925f3..25295ae370b29c 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -1734,6 +1734,7 @@ struct i40e_mac_filter *i40e_add_mac_filter(struct i40e_vsi *vsi, struct hlist_node *h; int bkt; + lockdep_assert_held(&vsi->mac_filter_hash_lock); if (vsi->info.pvid) return i40e_add_filter(vsi, macaddr, le16_to_cpu(vsi->info.pvid)); diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 662622f01e3125..dfa785e39458db 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2213,8 +2213,10 @@ static int i40e_vc_get_vf_resources_msg(struct i40e_vf *vf, u8 *msg) vfres->vsi_res[0].qset_handle = le16_to_cpu(vsi->info.qs_handle[0]); if (!(vf->driver_caps & VIRTCHNL_VF_OFFLOAD_USO) && !vf->pf_set_mac) { + spin_lock_bh(&vsi->mac_filter_hash_lock); i40e_del_mac_filter(vsi, vf->default_lan_addr.addr); eth_zero_addr(vf->default_lan_addr.addr); + spin_unlock_bh(&vsi->mac_filter_hash_lock); } ether_addr_copy(vfres->vsi_res[0].default_mac_addr, vf->default_lan_addr.addr); From 330a699ecbfc9c26ec92c6310686da1230b4e7eb Mon Sep 17 00:00:00 2001 From: Mohamed Khalfella Date: Tue, 24 Sep 2024 15:06:01 -0600 Subject: [PATCH 54/87] igb: Do not bring the device up after non-fatal error Commit 004d25060c78 ("igb: Fix igb_down hung on surprise removal") changed igb_io_error_detected() to ignore non-fatal pcie errors in order to avoid hung task that can happen when igb_down() is called multiple times. This caused an issue when processing transient non-fatal errors. igb_io_resume(), which is called after igb_io_error_detected(), assumes that device is brought down by igb_io_error_detected() if the interface is up. This resulted in panic with stacktrace below. [ T3256] igb 0000:09:00.0 haeth0: igb: haeth0 NIC Link is Down [ T292] pcieport 0000:00:1c.5: AER: Uncorrected (Non-Fatal) error received: 0000:09:00.0 [ T292] igb 0000:09:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ T292] igb 0000:09:00.0: device [8086:1537] error status/mask=00004000/00000000 [ T292] igb 0000:09:00.0: [14] CmpltTO [ 200.105524,009][ T292] igb 0000:09:00.0: AER: TLP Header: 00000000 00000000 00000000 00000000 [ T292] pcieport 0000:00:1c.5: AER: broadcast error_detected message [ T292] igb 0000:09:00.0: Non-correctable non-fatal error reported. [ T292] pcieport 0000:00:1c.5: AER: broadcast mmio_enabled message [ T292] pcieport 0000:00:1c.5: AER: broadcast resume message [ T292] ------------[ cut here ]------------ [ T292] kernel BUG at net/core/dev.c:6539! [ T292] invalid opcode: 0000 [#1] PREEMPT SMP [ T292] RIP: 0010:napi_enable+0x37/0x40 [ T292] Call Trace: [ T292] [ T292] ? die+0x33/0x90 [ T292] ? do_trap+0xdc/0x110 [ T292] ? napi_enable+0x37/0x40 [ T292] ? do_error_trap+0x70/0xb0 [ T292] ? napi_enable+0x37/0x40 [ T292] ? napi_enable+0x37/0x40 [ T292] ? exc_invalid_op+0x4e/0x70 [ T292] ? napi_enable+0x37/0x40 [ T292] ? asm_exc_invalid_op+0x16/0x20 [ T292] ? napi_enable+0x37/0x40 [ T292] igb_up+0x41/0x150 [ T292] igb_io_resume+0x25/0x70 [ T292] report_resume+0x54/0x70 [ T292] ? report_frozen_detected+0x20/0x20 [ T292] pci_walk_bus+0x6c/0x90 [ T292] ? aer_print_port_info+0xa0/0xa0 [ T292] pcie_do_recovery+0x22f/0x380 [ T292] aer_process_err_devices+0x110/0x160 [ T292] aer_isr+0x1c1/0x1e0 [ T292] ? disable_irq_nosync+0x10/0x10 [ T292] irq_thread_fn+0x1a/0x60 [ T292] irq_thread+0xe3/0x1a0 [ T292] ? irq_set_affinity_notifier+0x120/0x120 [ T292] ? irq_affinity_notify+0x100/0x100 [ T292] kthread+0xe2/0x110 [ T292] ? kthread_complete_and_exit+0x20/0x20 [ T292] ret_from_fork+0x2d/0x50 [ T292] ? kthread_complete_and_exit+0x20/0x20 [ T292] ret_from_fork_asm+0x11/0x20 [ T292] To fix this issue igb_io_resume() checks if the interface is running and the device is not down this means igb_io_error_detected() did not bring the device down and there is no need to bring it up. Signed-off-by: Mohamed Khalfella Reviewed-by: Yuanyuan Zhong Fixes: 004d25060c78 ("igb: Fix igb_down hung on surprise removal") Reviewed-by: Simon Horman Tested-by: Pucha Himasekhar Reddy (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/igb/igb_main.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 1ef4cb871452a6..f1d0881687233e 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -9651,6 +9651,10 @@ static void igb_io_resume(struct pci_dev *pdev) struct igb_adapter *adapter = netdev_priv(netdev); if (netif_running(netdev)) { + if (!test_bit(__IGB_DOWN, &adapter->state)) { + dev_dbg(&pdev->dev, "Resuming from non-fatal error, do nothing.\n"); + return; + } if (igb_up(adapter)) { dev_err(&pdev->dev, "igb_up failed after reset\n"); return; From 9d9e5347b035412daa844f884b94a05bac94f864 Mon Sep 17 00:00:00 2001 From: Vitaly Lifshits Date: Sun, 8 Sep 2024 09:49:17 +0300 Subject: [PATCH 55/87] e1000e: change I219 (19) devices to ADP Sporadic issues, such as PHY access loss, have been observed on I219 (19) devices. It was found that these devices have hardware more closely related to ADP than MTP and the issues were caused by taking MTP-specific flows. Change the MAC and board types of these devices from MTP to ADP to correctly reflect the LAN hardware, and flows, of these devices. Fixes: db2d737d63c5 ("e1000e: Separate MTP board type from ADP") Signed-off-by: Vitaly Lifshits Tested-by: Mor Bar-Gabay Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/e1000e/hw.h | 4 ++-- drivers/net/ethernet/intel/e1000e/netdev.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/hw.h b/drivers/net/ethernet/intel/e1000e/hw.h index 4b6e7536170abc..fc8ed38aa09554 100644 --- a/drivers/net/ethernet/intel/e1000e/hw.h +++ b/drivers/net/ethernet/intel/e1000e/hw.h @@ -108,8 +108,8 @@ struct e1000_hw; #define E1000_DEV_ID_PCH_RPL_I219_V22 0x0DC8 #define E1000_DEV_ID_PCH_MTP_I219_LM18 0x550A #define E1000_DEV_ID_PCH_MTP_I219_V18 0x550B -#define E1000_DEV_ID_PCH_MTP_I219_LM19 0x550C -#define E1000_DEV_ID_PCH_MTP_I219_V19 0x550D +#define E1000_DEV_ID_PCH_ADP_I219_LM19 0x550C +#define E1000_DEV_ID_PCH_ADP_I219_V19 0x550D #define E1000_DEV_ID_PCH_LNP_I219_LM20 0x550E #define E1000_DEV_ID_PCH_LNP_I219_V20 0x550F #define E1000_DEV_ID_PCH_LNP_I219_LM21 0x5510 diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index f103249b12facf..07e90334635824 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -7899,10 +7899,10 @@ static const struct pci_device_id e1000_pci_tbl[] = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V17), board_pch_adp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_LM22), board_pch_adp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_V22), board_pch_adp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_LM19), board_pch_adp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V19), board_pch_adp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM18), board_pch_mtp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V18), board_pch_mtp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM19), board_pch_mtp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V19), board_pch_mtp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM20), board_pch_mtp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_V20), board_pch_mtp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM21), board_pch_mtp }, From 3cb7cf1540ddff5473d6baeb530228d19bc97b8a Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 7 Oct 2024 18:41:30 +0000 Subject: [PATCH 56/87] net/sched: accept TCA_STAB only for root qdisc Most qdiscs maintain their backlog using qdisc_pkt_len(skb) on the assumption it is invariant between the enqueue() and dequeue() handlers. Unfortunately syzbot can crash a host rather easily using a TBF + SFQ combination, with an STAB on SFQ [1] We can't support TCA_STAB on arbitrary level, this would require to maintain per-qdisc storage. [1] [ 88.796496] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 88.798611] #PF: supervisor read access in kernel mode [ 88.799014] #PF: error_code(0x0000) - not-present page [ 88.799506] PGD 0 P4D 0 [ 88.799829] Oops: Oops: 0000 [#1] SMP NOPTI [ 88.800569] CPU: 14 UID: 0 PID: 2053 Comm: b371744477 Not tainted 6.12.0-rc1-virtme #1117 [ 88.801107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 88.801779] RIP: 0010:sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.802544] Code: 0f b7 50 12 48 8d 04 d5 00 00 00 00 48 89 d6 48 29 d0 48 8b 91 c0 01 00 00 48 c1 e0 03 48 01 c2 66 83 7a 1a 00 7e c0 48 8b 3a <4c> 8b 07 4c 89 02 49 89 50 08 48 c7 47 08 00 00 00 00 48 c7 07 00 All code ======== 0: 0f b7 50 12 movzwl 0x12(%rax),%edx 4: 48 8d 04 d5 00 00 00 lea 0x0(,%rdx,8),%rax b: 00 c: 48 89 d6 mov %rdx,%rsi f: 48 29 d0 sub %rdx,%rax 12: 48 8b 91 c0 01 00 00 mov 0x1c0(%rcx),%rdx 19: 48 c1 e0 03 shl $0x3,%rax 1d: 48 01 c2 add %rax,%rdx 20: 66 83 7a 1a 00 cmpw $0x0,0x1a(%rdx) 25: 7e c0 jle 0xffffffffffffffe7 27: 48 8b 3a mov (%rdx),%rdi 2a:* 4c 8b 07 mov (%rdi),%r8 <-- trapping instruction 2d: 4c 89 02 mov %r8,(%rdx) 30: 49 89 50 08 mov %rdx,0x8(%r8) 34: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 3b: 00 3c: 48 rex.W 3d: c7 .byte 0xc7 3e: 07 (bad) ... Code starting with the faulting instruction =========================================== 0: 4c 8b 07 mov (%rdi),%r8 3: 4c 89 02 mov %r8,(%rdx) 6: 49 89 50 08 mov %rdx,0x8(%r8) a: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 11: 00 12: 48 rex.W 13: c7 .byte 0xc7 14: 07 (bad) ... [ 88.803721] RSP: 0018:ffff9a1f892b7d58 EFLAGS: 00000206 [ 88.804032] RAX: 0000000000000000 RBX: ffff9a1f8420c800 RCX: ffff9a1f8420c800 [ 88.804560] RDX: ffff9a1f81bc1440 RSI: 0000000000000000 RDI: 0000000000000000 [ 88.805056] RBP: ffffffffc04bb0e0 R08: 0000000000000001 R09: 00000000ff7f9a1f [ 88.805473] R10: 000000000001001b R11: 0000000000009a1f R12: 0000000000000140 [ 88.806194] R13: 0000000000000001 R14: ffff9a1f886df400 R15: ffff9a1f886df4ac [ 88.806734] FS: 00007f445601a740(0000) GS:ffff9a2e7fd80000(0000) knlGS:0000000000000000 [ 88.807225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 88.807672] CR2: 0000000000000000 CR3: 000000050cc46000 CR4: 00000000000006f0 [ 88.808165] Call Trace: [ 88.808459] [ 88.808710] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) [ 88.809261] ? page_fault_oops (arch/x86/mm/fault.c:715) [ 88.809561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) [ 88.809806] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 88.810074] ? sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.810411] sfq_reset (net/sched/sch_sfq.c:525) sch_sfq [ 88.810671] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.810950] tbf_reset (./include/linux/timekeeping.h:169 net/sched/sch_tbf.c:334) sch_tbf [ 88.811208] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.811484] netif_set_real_num_tx_queues (./include/linux/spinlock.h:396 ./include/net/sch_generic.h:768 net/core/dev.c:2958) [ 88.811870] __tun_detach (drivers/net/tun.c:590 drivers/net/tun.c:673) [ 88.812271] tun_chr_close (drivers/net/tun.c:702 drivers/net/tun.c:3517) [ 88.812505] __fput (fs/file_table.c:432 (discriminator 1)) [ 88.812735] task_work_run (kernel/task_work.c:230) [ 88.813016] do_exit (kernel/exit.c:940) [ 88.813372] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4)) [ 88.813639] ? handle_mm_fault (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/memcontrol.h:1022 ./include/linux/memcontrol.h:1045 ./include/linux/memcontrol.h:1052 mm/memory.c:5928 mm/memory.c:6088) [ 88.813867] do_group_exit (kernel/exit.c:1070) [ 88.814138] __x64_sys_exit_group (kernel/exit.c:1099) [ 88.814490] x64_sys_call (??:?) [ 88.814791] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) [ 88.815012] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 88.815495] RIP: 0033:0x7f44560f1975 Fixes: 175f9c1bba9b ("net_sched: Add size table for qdiscs") Reported-by: syzbot Signed-off-by: Eric Dumazet Cc: Daniel Borkmann Link: https://patch.msgid.link/20241007184130.3960565-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- include/net/sch_generic.h | 1 - net/sched/sch_api.c | 7 ++++++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 79edd5b5e3c913..5d74fa7e694cc8 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -848,7 +848,6 @@ static inline void qdisc_calculate_pkt_len(struct sk_buff *skb, static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { - qdisc_calculate_pkt_len(skb, sch); return sch->enqueue(skb, sch, to_free); } diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 74afc210527d23..2eefa478387997 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -593,7 +593,6 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb, pkt_len = 1; qdisc_skb_cb(skb)->pkt_len = pkt_len; } -EXPORT_SYMBOL(__qdisc_calculate_pkt_len); void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc) { @@ -1201,6 +1200,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, return -EINVAL; } + if (new && + !(parent->flags & TCQ_F_MQROOT) && + rcu_access_pointer(new->stab)) { + NL_SET_ERR_MSG(extack, "STAB not supported on a non root"); + return -EINVAL; + } err = cops->graft(parent, cl, new, &old, extack); if (err) return err; From 08c8acc9d8f3f70d62dd928571368d5018206490 Mon Sep 17 00:00:00 2001 From: Rosen Penev Date: Mon, 7 Oct 2024 16:57:11 -0700 Subject: [PATCH 57/87] net: ibm: emac: mal: fix wrong goto dcr_map is called in the previous if and therefore needs to be unmapped. Fixes: 1ff0fcfcb1a6 ("ibm_newemac: Fix new MAL feature handling") Signed-off-by: Rosen Penev Link: https://patch.msgid.link/20241007235711.5714-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/ibm/emac/mal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ibm/emac/mal.c b/drivers/net/ethernet/ibm/emac/mal.c index d92dd9c83031ee..0c5e22d14372ac 100644 --- a/drivers/net/ethernet/ibm/emac/mal.c +++ b/drivers/net/ethernet/ibm/emac/mal.c @@ -578,7 +578,7 @@ static int mal_probe(struct platform_device *ofdev) printk(KERN_ERR "%pOF: Support for 405EZ not enabled!\n", ofdev->dev.of_node); err = -ENODEV; - goto fail; + goto fail_unmap; #endif } From ff8ee11e778520c5716b7f165d2c7ce14d6a068b Mon Sep 17 00:00:00 2001 From: MD Danish Anwar Date: Mon, 7 Oct 2024 11:11:24 +0530 Subject: [PATCH 58/87] net: ti: icssg-prueth: Fix race condition for VLAN table access The VLAN table is a shared memory between the two ports/slices in a ICSSG cluster and this may lead to race condition when the common code paths for both ports are executed in different CPUs. Fix the race condition access by locking the shared memory access Fixes: 487f7323f39a ("net: ti: icssg-prueth: Add helper functions to configure FDB") Signed-off-by: MD Danish Anwar Reviewed-by: Roger Quadros Signed-off-by: David S. Miller --- drivers/net/ethernet/ti/icssg/icssg_config.c | 2 ++ drivers/net/ethernet/ti/icssg/icssg_prueth.c | 1 + drivers/net/ethernet/ti/icssg/icssg_prueth.h | 2 ++ 3 files changed, 5 insertions(+) diff --git a/drivers/net/ethernet/ti/icssg/icssg_config.c b/drivers/net/ethernet/ti/icssg/icssg_config.c index 72ace151d8e9c9..5d2491c2943a8b 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_config.c +++ b/drivers/net/ethernet/ti/icssg/icssg_config.c @@ -735,6 +735,7 @@ void icssg_vtbl_modify(struct prueth_emac *emac, u8 vid, u8 port_mask, u8 fid_c1; tbl = prueth->vlan_tbl; + spin_lock(&prueth->vtbl_lock); fid_c1 = tbl[vid].fid_c1; /* FID_C1: bit0..2 port membership mask, @@ -750,6 +751,7 @@ void icssg_vtbl_modify(struct prueth_emac *emac, u8 vid, u8 port_mask, } tbl[vid].fid_c1 = fid_c1; + spin_unlock(&prueth->vtbl_lock); } EXPORT_SYMBOL_GPL(icssg_vtbl_modify); diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c index 5fd9902ab181e9..5c20ceb164dff2 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c @@ -1442,6 +1442,7 @@ static int prueth_probe(struct platform_device *pdev) icss_iep_init_fw(prueth->iep1); } + spin_lock_init(&prueth->vtbl_lock); /* setup netdev interfaces */ if (eth0_node) { ret = prueth_netdev_init(prueth, eth0_node); diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h index bba6da2e6bd8f9..8722bb4a268a15 100644 --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h @@ -296,6 +296,8 @@ struct prueth { bool is_switchmode_supported; unsigned char switch_id[MAX_PHYS_ITEM_ID_LEN]; int default_vlan; + /** @vtbl_lock: Lock for vtbl in shared memory */ + spinlock_t vtbl_lock; }; struct emac_tx_ts_response { From a6ad589c1d118f9d5b1bc4c6888d42919f830340 Mon Sep 17 00:00:00 2001 From: Heiner Kallweit Date: Mon, 7 Oct 2024 11:57:41 +0200 Subject: [PATCH 59/87] net: phy: realtek: Fix MMD access on RTL8126A-integrated PHY All MMD reads return 0 for the RTL8126A-integrated PHY. Therefore phylib assumes it doesn't support EEE, what results in higher power consumption, and a significantly higher chip temperature in my case. To fix this split out the PHY driver for the RTL8126A-integrated PHY and set the read_mmd/write_mmd callbacks to read from vendor-specific registers. Fixes: 5befa3728b85 ("net: phy: realtek: add support for RTL8126A-integrated 5Gbps PHY") Cc: stable@vger.kernel.org Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller --- drivers/net/phy/realtek.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c index c15d2f66ef0dc2..166f6a7283731e 100644 --- a/drivers/net/phy/realtek.c +++ b/drivers/net/phy/realtek.c @@ -1081,6 +1081,16 @@ static int rtl8221b_vn_cg_c45_match_phy_device(struct phy_device *phydev) return rtlgen_is_c45_match(phydev, RTL_8221B_VN_CG, true); } +static int rtl8251b_c22_match_phy_device(struct phy_device *phydev) +{ + return rtlgen_is_c45_match(phydev, RTL_8251B, false); +} + +static int rtl8251b_c45_match_phy_device(struct phy_device *phydev) +{ + return rtlgen_is_c45_match(phydev, RTL_8251B, true); +} + static int rtlgen_resume(struct phy_device *phydev) { int ret = genphy_resume(phydev); @@ -1418,7 +1428,7 @@ static struct phy_driver realtek_drvs[] = { .suspend = genphy_c45_pma_suspend, .resume = rtlgen_c45_resume, }, { - PHY_ID_MATCH_EXACT(0x001cc862), + .match_phy_device = rtl8251b_c45_match_phy_device, .name = "RTL8251B 5Gbps PHY", .get_features = rtl822x_get_features, .config_aneg = rtl822x_config_aneg, @@ -1427,6 +1437,18 @@ static struct phy_driver realtek_drvs[] = { .resume = rtlgen_resume, .read_page = rtl821x_read_page, .write_page = rtl821x_write_page, + }, { + .match_phy_device = rtl8251b_c22_match_phy_device, + .name = "RTL8126A-internal 5Gbps PHY", + .get_features = rtl822x_get_features, + .config_aneg = rtl822x_config_aneg, + .read_status = rtl822x_read_status, + .suspend = genphy_suspend, + .resume = rtlgen_resume, + .read_page = rtl821x_read_page, + .write_page = rtl821x_write_page, + .read_mmd = rtl822x_read_mmd, + .write_mmd = rtl822x_write_mmd, }, { PHY_ID_MATCH_EXACT(0x001ccad0), .name = "RTL8224 2.5Gbps PHY", From 82c5b53140faf89c31ea2b3a0985a2f291694169 Mon Sep 17 00:00:00 2001 From: Daniel Palmer Date: Mon, 7 Oct 2024 19:43:17 +0900 Subject: [PATCH 60/87] net: amd: mvme147: Fix probe banner message Currently this driver prints this line with what looks like a rogue format specifier when the device is probed: [ 2.840000] eth%d: MVME147 at 0xfffe1800, irq 12, Hardware Address xx:xx:xx:xx:xx:xx Change the printk() for netdev_info() and move it after the registration has completed so it prints out the name of the interface properly. Signed-off-by: Daniel Palmer Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- drivers/net/ethernet/amd/mvme147.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/amd/mvme147.c b/drivers/net/ethernet/amd/mvme147.c index c156566c09064c..f19b04b92fa9f3 100644 --- a/drivers/net/ethernet/amd/mvme147.c +++ b/drivers/net/ethernet/amd/mvme147.c @@ -105,10 +105,6 @@ static struct net_device * __init mvme147lance_probe(void) macaddr[3] = address&0xff; eth_hw_addr_set(dev, macaddr); - printk("%s: MVME147 at 0x%08lx, irq %d, Hardware Address %pM\n", - dev->name, dev->base_addr, MVME147_LANCE_IRQ, - dev->dev_addr); - lp = netdev_priv(dev); lp->ram = __get_dma_pages(GFP_ATOMIC, 3); /* 32K */ if (!lp->ram) { @@ -138,6 +134,9 @@ static struct net_device * __init mvme147lance_probe(void) return ERR_PTR(err); } + netdev_info(dev, "MVME147 at 0x%08lx, irq %d, Hardware Address %pM\n", + dev->base_addr, MVME147_LANCE_IRQ, dev->dev_addr); + return dev; } From 4d5c70e6155d5eae198bade4afeab3c1b15073b6 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Mon, 7 Oct 2024 12:25:11 -0400 Subject: [PATCH 61/87] sctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_start If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible. This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error. Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket. Fixes: 76c6d988aeb3 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- net/sctp/socket.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index 078bcb3858c79c..36ee34f483d703 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -8531,6 +8531,7 @@ static int sctp_listen_start(struct sock *sk, int backlog) struct sctp_endpoint *ep = sp->ep; struct crypto_shash *tfm = NULL; char alg[32]; + int err; /* Allocate HMAC for generating cookie. */ if (!sp->hmac && sp->sctp_hmac_alg) { @@ -8558,18 +8559,25 @@ static int sctp_listen_start(struct sock *sk, int backlog) inet_sk_set_state(sk, SCTP_SS_LISTENING); if (!ep->base.bind_addr.port) { if (sctp_autobind(sk)) { - inet_sk_set_state(sk, SCTP_SS_CLOSED); - return -EAGAIN; + err = -EAGAIN; + goto err; } } else { if (sctp_get_port(sk, inet_sk(sk)->inet_num)) { - inet_sk_set_state(sk, SCTP_SS_CLOSED); - return -EADDRINUSE; + err = -EADDRINUSE; + goto err; } } WRITE_ONCE(sk->sk_max_ack_backlog, backlog); - return sctp_hash_endpoint(ep); + err = sctp_hash_endpoint(ep); + if (err) + goto err; + + return 0; +err: + inet_sk_set_state(sk, SCTP_SS_CLOSED); + return err; } /* From 983e35ce2e1ee4037f6f5d5398dfc107b22ad569 Mon Sep 17 00:00:00 2001 From: Jijie Shao Date: Tue, 8 Oct 2024 10:48:36 +0800 Subject: [PATCH 62/87] net: hns3/hns: Update the maintainer for the HNS3/HNS ethernet driver Yisen Zhuang has left the company in September. Jian Shen will be responsible for maintaining the hns3/hns driver's code in the future, so add Jian Shen to the hns3/hns driver's matainer list. Signed-off-by: Jijie Shao Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index c27f3190737f8b..58380aeafbf0da 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10267,7 +10267,7 @@ F: Documentation/devicetree/bindings/arm/hisilicon/low-pin-count.yaml F: drivers/bus/hisi_lpc.c HISILICON NETWORK SUBSYSTEM 3 DRIVER (HNS3) -M: Yisen Zhuang +M: Jian Shen M: Salil Mehta M: Jijie Shao L: netdev@vger.kernel.org @@ -10276,7 +10276,7 @@ W: http://www.hisilicon.com F: drivers/net/ethernet/hisilicon/hns3/ HISILICON NETWORK SUBSYSTEM DRIVER -M: Yisen Zhuang +M: Jian Shen M: Salil Mehta L: netdev@vger.kernel.org S: Maintained From 0bfcb7b71e735560077a42847f69597ec7dcc326 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Mon, 7 Oct 2024 11:28:16 +0200 Subject: [PATCH 63/87] netfilter: xtables: avoid NFPROTO_UNSPEC where needed syzbot managed to call xt_cluster match via ebtables: WARNING: CPU: 0 PID: 11 at net/netfilter/xt_cluster.c:72 xt_cluster_mt+0x196/0x780 [..] ebt_do_table+0x174b/0x2a40 Module registers to NFPROTO_UNSPEC, but it assumes ipv4/ipv6 packet processing. As this is only useful to restrict locally terminating TCP/UDP traffic, register this for ipv4 and ipv6 family only. Pablo points out that this is a general issue, direct users of the set/getsockopt interface can call into targets/matches that were only intended for use with ip(6)tables. Check all UNSPEC matches and targets for similar issues: - matches and targets are fine except if they assume skb_network_header() is valid -- this is only true when called from inet layer: ip(6) stack pulls the ip/ipv6 header into linear data area. - targets that return XT_CONTINUE or other xtables verdicts must be restricted too, they are incompatbile with the ebtables traverser, e.g. EBT_CONTINUE is a completely different value than XT_CONTINUE. Most matches/targets are changed to register for NFPROTO_IPV4/IPV6, as they are provided for use by ip(6)tables. The MARK target is also used by arptables, so register for NFPROTO_ARP too. While at it, bail out if connbytes fails to enable the corresponding conntrack family. This change passes the selftests in iptables.git. Reported-by: syzbot+256c348558aa5cf611a9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netfilter-devel/66fec2e2.050a0220.9ec68.0047.GAE@google.com/ Fixes: 0269ea493734 ("netfilter: xtables: add cluster match") Signed-off-by: Florian Westphal Co-developed-by: Pablo Neira Ayuso Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_CHECKSUM.c | 33 ++++++---- net/netfilter/xt_CLASSIFY.c | 16 ++++- net/netfilter/xt_CONNSECMARK.c | 36 +++++++---- net/netfilter/xt_CT.c | 106 +++++++++++++++++++++------------ net/netfilter/xt_IDLETIMER.c | 59 ++++++++++++------ net/netfilter/xt_LED.c | 39 ++++++++---- net/netfilter/xt_NFLOG.c | 36 +++++++---- net/netfilter/xt_RATEEST.c | 39 ++++++++---- net/netfilter/xt_SECMARK.c | 27 ++++++++- net/netfilter/xt_TRACE.c | 35 +++++++---- net/netfilter/xt_addrtype.c | 15 ++++- net/netfilter/xt_cluster.c | 33 ++++++---- net/netfilter/xt_connbytes.c | 4 +- net/netfilter/xt_connlimit.c | 39 ++++++++---- net/netfilter/xt_connmark.c | 28 ++++++++- net/netfilter/xt_mark.c | 42 +++++++++---- 16 files changed, 422 insertions(+), 165 deletions(-) diff --git a/net/netfilter/xt_CHECKSUM.c b/net/netfilter/xt_CHECKSUM.c index c8a639f5616841..9d99f5a3d1764b 100644 --- a/net/netfilter/xt_CHECKSUM.c +++ b/net/netfilter/xt_CHECKSUM.c @@ -63,24 +63,37 @@ static int checksum_tg_check(const struct xt_tgchk_param *par) return 0; } -static struct xt_target checksum_tg_reg __read_mostly = { - .name = "CHECKSUM", - .family = NFPROTO_UNSPEC, - .target = checksum_tg, - .targetsize = sizeof(struct xt_CHECKSUM_info), - .table = "mangle", - .checkentry = checksum_tg_check, - .me = THIS_MODULE, +static struct xt_target checksum_tg_reg[] __read_mostly = { + { + .name = "CHECKSUM", + .family = NFPROTO_IPV4, + .target = checksum_tg, + .targetsize = sizeof(struct xt_CHECKSUM_info), + .table = "mangle", + .checkentry = checksum_tg_check, + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "CHECKSUM", + .family = NFPROTO_IPV6, + .target = checksum_tg, + .targetsize = sizeof(struct xt_CHECKSUM_info), + .table = "mangle", + .checkentry = checksum_tg_check, + .me = THIS_MODULE, + }, +#endif }; static int __init checksum_tg_init(void) { - return xt_register_target(&checksum_tg_reg); + return xt_register_targets(checksum_tg_reg, ARRAY_SIZE(checksum_tg_reg)); } static void __exit checksum_tg_exit(void) { - xt_unregister_target(&checksum_tg_reg); + xt_unregister_targets(checksum_tg_reg, ARRAY_SIZE(checksum_tg_reg)); } module_init(checksum_tg_init); diff --git a/net/netfilter/xt_CLASSIFY.c b/net/netfilter/xt_CLASSIFY.c index 0accac98dea784..0ae8d8a1216e19 100644 --- a/net/netfilter/xt_CLASSIFY.c +++ b/net/netfilter/xt_CLASSIFY.c @@ -38,9 +38,9 @@ static struct xt_target classify_tg_reg[] __read_mostly = { { .name = "CLASSIFY", .revision = 0, - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .hooks = (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_FORWARD) | - (1 << NF_INET_POST_ROUTING), + (1 << NF_INET_POST_ROUTING), .target = classify_tg, .targetsize = sizeof(struct xt_classify_target_info), .me = THIS_MODULE, @@ -54,6 +54,18 @@ static struct xt_target classify_tg_reg[] __read_mostly = { .targetsize = sizeof(struct xt_classify_target_info), .me = THIS_MODULE, }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "CLASSIFY", + .revision = 0, + .family = NFPROTO_IPV6, + .hooks = (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_FORWARD) | + (1 << NF_INET_POST_ROUTING), + .target = classify_tg, + .targetsize = sizeof(struct xt_classify_target_info), + .me = THIS_MODULE, + }, +#endif }; static int __init classify_tg_init(void) diff --git a/net/netfilter/xt_CONNSECMARK.c b/net/netfilter/xt_CONNSECMARK.c index 76acecf3e757a0..1494b3ee30e11e 100644 --- a/net/netfilter/xt_CONNSECMARK.c +++ b/net/netfilter/xt_CONNSECMARK.c @@ -114,25 +114,39 @@ static void connsecmark_tg_destroy(const struct xt_tgdtor_param *par) nf_ct_netns_put(par->net, par->family); } -static struct xt_target connsecmark_tg_reg __read_mostly = { - .name = "CONNSECMARK", - .revision = 0, - .family = NFPROTO_UNSPEC, - .checkentry = connsecmark_tg_check, - .destroy = connsecmark_tg_destroy, - .target = connsecmark_tg, - .targetsize = sizeof(struct xt_connsecmark_target_info), - .me = THIS_MODULE, +static struct xt_target connsecmark_tg_reg[] __read_mostly = { + { + .name = "CONNSECMARK", + .revision = 0, + .family = NFPROTO_IPV4, + .checkentry = connsecmark_tg_check, + .destroy = connsecmark_tg_destroy, + .target = connsecmark_tg, + .targetsize = sizeof(struct xt_connsecmark_target_info), + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "CONNSECMARK", + .revision = 0, + .family = NFPROTO_IPV6, + .checkentry = connsecmark_tg_check, + .destroy = connsecmark_tg_destroy, + .target = connsecmark_tg, + .targetsize = sizeof(struct xt_connsecmark_target_info), + .me = THIS_MODULE, + }, +#endif }; static int __init connsecmark_tg_init(void) { - return xt_register_target(&connsecmark_tg_reg); + return xt_register_targets(connsecmark_tg_reg, ARRAY_SIZE(connsecmark_tg_reg)); } static void __exit connsecmark_tg_exit(void) { - xt_unregister_target(&connsecmark_tg_reg); + xt_unregister_targets(connsecmark_tg_reg, ARRAY_SIZE(connsecmark_tg_reg)); } module_init(connsecmark_tg_init); diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 2be2f7a7b60f4e..3ba94c34297cf5 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -313,10 +313,30 @@ static void xt_ct_tg_destroy_v1(const struct xt_tgdtor_param *par) xt_ct_tg_destroy(par, par->targinfo); } +static unsigned int +notrack_tg(struct sk_buff *skb, const struct xt_action_param *par) +{ + /* Previously seen (loopback)? Ignore. */ + if (skb->_nfct != 0) + return XT_CONTINUE; + + nf_ct_set(skb, NULL, IP_CT_UNTRACKED); + + return XT_CONTINUE; +} + static struct xt_target xt_ct_tg_reg[] __read_mostly = { + { + .name = "NOTRACK", + .revision = 0, + .family = NFPROTO_IPV4, + .target = notrack_tg, + .table = "raw", + .me = THIS_MODULE, + }, { .name = "CT", - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .targetsize = sizeof(struct xt_ct_target_info), .usersize = offsetof(struct xt_ct_target_info, ct), .checkentry = xt_ct_tg_check_v0, @@ -327,7 +347,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = { }, { .name = "CT", - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .revision = 1, .targetsize = sizeof(struct xt_ct_target_info_v1), .usersize = offsetof(struct xt_ct_target_info, ct), @@ -339,7 +359,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = { }, { .name = "CT", - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .revision = 2, .targetsize = sizeof(struct xt_ct_target_info_v1), .usersize = offsetof(struct xt_ct_target_info, ct), @@ -349,49 +369,61 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = { .table = "raw", .me = THIS_MODULE, }, -}; - -static unsigned int -notrack_tg(struct sk_buff *skb, const struct xt_action_param *par) -{ - /* Previously seen (loopback)? Ignore. */ - if (skb->_nfct != 0) - return XT_CONTINUE; - - nf_ct_set(skb, NULL, IP_CT_UNTRACKED); - - return XT_CONTINUE; -} - -static struct xt_target notrack_tg_reg __read_mostly = { - .name = "NOTRACK", - .revision = 0, - .family = NFPROTO_UNSPEC, - .target = notrack_tg, - .table = "raw", - .me = THIS_MODULE, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "NOTRACK", + .revision = 0, + .family = NFPROTO_IPV6, + .target = notrack_tg, + .table = "raw", + .me = THIS_MODULE, + }, + { + .name = "CT", + .family = NFPROTO_IPV6, + .targetsize = sizeof(struct xt_ct_target_info), + .usersize = offsetof(struct xt_ct_target_info, ct), + .checkentry = xt_ct_tg_check_v0, + .destroy = xt_ct_tg_destroy_v0, + .target = xt_ct_target_v0, + .table = "raw", + .me = THIS_MODULE, + }, + { + .name = "CT", + .family = NFPROTO_IPV6, + .revision = 1, + .targetsize = sizeof(struct xt_ct_target_info_v1), + .usersize = offsetof(struct xt_ct_target_info, ct), + .checkentry = xt_ct_tg_check_v1, + .destroy = xt_ct_tg_destroy_v1, + .target = xt_ct_target_v1, + .table = "raw", + .me = THIS_MODULE, + }, + { + .name = "CT", + .family = NFPROTO_IPV6, + .revision = 2, + .targetsize = sizeof(struct xt_ct_target_info_v1), + .usersize = offsetof(struct xt_ct_target_info, ct), + .checkentry = xt_ct_tg_check_v2, + .destroy = xt_ct_tg_destroy_v1, + .target = xt_ct_target_v1, + .table = "raw", + .me = THIS_MODULE, + }, +#endif }; static int __init xt_ct_tg_init(void) { - int ret; - - ret = xt_register_target(¬rack_tg_reg); - if (ret < 0) - return ret; - - ret = xt_register_targets(xt_ct_tg_reg, ARRAY_SIZE(xt_ct_tg_reg)); - if (ret < 0) { - xt_unregister_target(¬rack_tg_reg); - return ret; - } - return 0; + return xt_register_targets(xt_ct_tg_reg, ARRAY_SIZE(xt_ct_tg_reg)); } static void __exit xt_ct_tg_exit(void) { xt_unregister_targets(xt_ct_tg_reg, ARRAY_SIZE(xt_ct_tg_reg)); - xt_unregister_target(¬rack_tg_reg); } module_init(xt_ct_tg_init); diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c index db720efa811d58..f8b25b6f5da736 100644 --- a/net/netfilter/xt_IDLETIMER.c +++ b/net/netfilter/xt_IDLETIMER.c @@ -458,28 +458,49 @@ static void idletimer_tg_destroy_v1(const struct xt_tgdtor_param *par) static struct xt_target idletimer_tg[] __read_mostly = { { - .name = "IDLETIMER", - .family = NFPROTO_UNSPEC, - .target = idletimer_tg_target, - .targetsize = sizeof(struct idletimer_tg_info), - .usersize = offsetof(struct idletimer_tg_info, timer), - .checkentry = idletimer_tg_checkentry, - .destroy = idletimer_tg_destroy, - .me = THIS_MODULE, + .name = "IDLETIMER", + .family = NFPROTO_IPV4, + .target = idletimer_tg_target, + .targetsize = sizeof(struct idletimer_tg_info), + .usersize = offsetof(struct idletimer_tg_info, timer), + .checkentry = idletimer_tg_checkentry, + .destroy = idletimer_tg_destroy, + .me = THIS_MODULE, }, { - .name = "IDLETIMER", - .family = NFPROTO_UNSPEC, - .revision = 1, - .target = idletimer_tg_target_v1, - .targetsize = sizeof(struct idletimer_tg_info_v1), - .usersize = offsetof(struct idletimer_tg_info_v1, timer), - .checkentry = idletimer_tg_checkentry_v1, - .destroy = idletimer_tg_destroy_v1, - .me = THIS_MODULE, + .name = "IDLETIMER", + .family = NFPROTO_IPV4, + .revision = 1, + .target = idletimer_tg_target_v1, + .targetsize = sizeof(struct idletimer_tg_info_v1), + .usersize = offsetof(struct idletimer_tg_info_v1, timer), + .checkentry = idletimer_tg_checkentry_v1, + .destroy = idletimer_tg_destroy_v1, + .me = THIS_MODULE, }, - - +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "IDLETIMER", + .family = NFPROTO_IPV6, + .target = idletimer_tg_target, + .targetsize = sizeof(struct idletimer_tg_info), + .usersize = offsetof(struct idletimer_tg_info, timer), + .checkentry = idletimer_tg_checkentry, + .destroy = idletimer_tg_destroy, + .me = THIS_MODULE, + }, + { + .name = "IDLETIMER", + .family = NFPROTO_IPV6, + .revision = 1, + .target = idletimer_tg_target_v1, + .targetsize = sizeof(struct idletimer_tg_info_v1), + .usersize = offsetof(struct idletimer_tg_info_v1, timer), + .checkentry = idletimer_tg_checkentry_v1, + .destroy = idletimer_tg_destroy_v1, + .me = THIS_MODULE, + }, +#endif }; static struct class *idletimer_tg_class; diff --git a/net/netfilter/xt_LED.c b/net/netfilter/xt_LED.c index 36c9720ad8d6d4..f7b0286d106ac1 100644 --- a/net/netfilter/xt_LED.c +++ b/net/netfilter/xt_LED.c @@ -175,26 +175,41 @@ static void led_tg_destroy(const struct xt_tgdtor_param *par) kfree(ledinternal); } -static struct xt_target led_tg_reg __read_mostly = { - .name = "LED", - .revision = 0, - .family = NFPROTO_UNSPEC, - .target = led_tg, - .targetsize = sizeof(struct xt_led_info), - .usersize = offsetof(struct xt_led_info, internal_data), - .checkentry = led_tg_check, - .destroy = led_tg_destroy, - .me = THIS_MODULE, +static struct xt_target led_tg_reg[] __read_mostly = { + { + .name = "LED", + .revision = 0, + .family = NFPROTO_IPV4, + .target = led_tg, + .targetsize = sizeof(struct xt_led_info), + .usersize = offsetof(struct xt_led_info, internal_data), + .checkentry = led_tg_check, + .destroy = led_tg_destroy, + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "LED", + .revision = 0, + .family = NFPROTO_IPV6, + .target = led_tg, + .targetsize = sizeof(struct xt_led_info), + .usersize = offsetof(struct xt_led_info, internal_data), + .checkentry = led_tg_check, + .destroy = led_tg_destroy, + .me = THIS_MODULE, + }, +#endif }; static int __init led_tg_init(void) { - return xt_register_target(&led_tg_reg); + return xt_register_targets(led_tg_reg, ARRAY_SIZE(led_tg_reg)); } static void __exit led_tg_exit(void) { - xt_unregister_target(&led_tg_reg); + xt_unregister_targets(led_tg_reg, ARRAY_SIZE(led_tg_reg)); } module_init(led_tg_init); diff --git a/net/netfilter/xt_NFLOG.c b/net/netfilter/xt_NFLOG.c index e660c3710a1096..d80abd6ccaf8f7 100644 --- a/net/netfilter/xt_NFLOG.c +++ b/net/netfilter/xt_NFLOG.c @@ -64,25 +64,39 @@ static void nflog_tg_destroy(const struct xt_tgdtor_param *par) nf_logger_put(par->family, NF_LOG_TYPE_ULOG); } -static struct xt_target nflog_tg_reg __read_mostly = { - .name = "NFLOG", - .revision = 0, - .family = NFPROTO_UNSPEC, - .checkentry = nflog_tg_check, - .destroy = nflog_tg_destroy, - .target = nflog_tg, - .targetsize = sizeof(struct xt_nflog_info), - .me = THIS_MODULE, +static struct xt_target nflog_tg_reg[] __read_mostly = { + { + .name = "NFLOG", + .revision = 0, + .family = NFPROTO_IPV4, + .checkentry = nflog_tg_check, + .destroy = nflog_tg_destroy, + .target = nflog_tg, + .targetsize = sizeof(struct xt_nflog_info), + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "NFLOG", + .revision = 0, + .family = NFPROTO_IPV4, + .checkentry = nflog_tg_check, + .destroy = nflog_tg_destroy, + .target = nflog_tg, + .targetsize = sizeof(struct xt_nflog_info), + .me = THIS_MODULE, + }, +#endif }; static int __init nflog_tg_init(void) { - return xt_register_target(&nflog_tg_reg); + return xt_register_targets(nflog_tg_reg, ARRAY_SIZE(nflog_tg_reg)); } static void __exit nflog_tg_exit(void) { - xt_unregister_target(&nflog_tg_reg); + xt_unregister_targets(nflog_tg_reg, ARRAY_SIZE(nflog_tg_reg)); } module_init(nflog_tg_init); diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c index 80f6624e23554b..4f49cfc2783120 100644 --- a/net/netfilter/xt_RATEEST.c +++ b/net/netfilter/xt_RATEEST.c @@ -179,16 +179,31 @@ static void xt_rateest_tg_destroy(const struct xt_tgdtor_param *par) xt_rateest_put(par->net, info->est); } -static struct xt_target xt_rateest_tg_reg __read_mostly = { - .name = "RATEEST", - .revision = 0, - .family = NFPROTO_UNSPEC, - .target = xt_rateest_tg, - .checkentry = xt_rateest_tg_checkentry, - .destroy = xt_rateest_tg_destroy, - .targetsize = sizeof(struct xt_rateest_target_info), - .usersize = offsetof(struct xt_rateest_target_info, est), - .me = THIS_MODULE, +static struct xt_target xt_rateest_tg_reg[] __read_mostly = { + { + .name = "RATEEST", + .revision = 0, + .family = NFPROTO_IPV4, + .target = xt_rateest_tg, + .checkentry = xt_rateest_tg_checkentry, + .destroy = xt_rateest_tg_destroy, + .targetsize = sizeof(struct xt_rateest_target_info), + .usersize = offsetof(struct xt_rateest_target_info, est), + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "RATEEST", + .revision = 0, + .family = NFPROTO_IPV6, + .target = xt_rateest_tg, + .checkentry = xt_rateest_tg_checkentry, + .destroy = xt_rateest_tg_destroy, + .targetsize = sizeof(struct xt_rateest_target_info), + .usersize = offsetof(struct xt_rateest_target_info, est), + .me = THIS_MODULE, + }, +#endif }; static __net_init int xt_rateest_net_init(struct net *net) @@ -214,12 +229,12 @@ static int __init xt_rateest_tg_init(void) if (err) return err; - return xt_register_target(&xt_rateest_tg_reg); + return xt_register_targets(xt_rateest_tg_reg, ARRAY_SIZE(xt_rateest_tg_reg)); } static void __exit xt_rateest_tg_fini(void) { - xt_unregister_target(&xt_rateest_tg_reg); + xt_unregister_targets(xt_rateest_tg_reg, ARRAY_SIZE(xt_rateest_tg_reg)); unregister_pernet_subsys(&xt_rateest_net_ops); } diff --git a/net/netfilter/xt_SECMARK.c b/net/netfilter/xt_SECMARK.c index 498a0bf6f0444a..5bc5ea505eb9e0 100644 --- a/net/netfilter/xt_SECMARK.c +++ b/net/netfilter/xt_SECMARK.c @@ -157,7 +157,7 @@ static struct xt_target secmark_tg_reg[] __read_mostly = { { .name = "SECMARK", .revision = 0, - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .checkentry = secmark_tg_check_v0, .destroy = secmark_tg_destroy, .target = secmark_tg_v0, @@ -167,7 +167,7 @@ static struct xt_target secmark_tg_reg[] __read_mostly = { { .name = "SECMARK", .revision = 1, - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .checkentry = secmark_tg_check_v1, .destroy = secmark_tg_destroy, .target = secmark_tg_v1, @@ -175,6 +175,29 @@ static struct xt_target secmark_tg_reg[] __read_mostly = { .usersize = offsetof(struct xt_secmark_target_info_v1, secid), .me = THIS_MODULE, }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "SECMARK", + .revision = 0, + .family = NFPROTO_IPV6, + .checkentry = secmark_tg_check_v0, + .destroy = secmark_tg_destroy, + .target = secmark_tg_v0, + .targetsize = sizeof(struct xt_secmark_target_info), + .me = THIS_MODULE, + }, + { + .name = "SECMARK", + .revision = 1, + .family = NFPROTO_IPV6, + .checkentry = secmark_tg_check_v1, + .destroy = secmark_tg_destroy, + .target = secmark_tg_v1, + .targetsize = sizeof(struct xt_secmark_target_info_v1), + .usersize = offsetof(struct xt_secmark_target_info_v1, secid), + .me = THIS_MODULE, + }, +#endif }; static int __init secmark_tg_init(void) diff --git a/net/netfilter/xt_TRACE.c b/net/netfilter/xt_TRACE.c index 5582dce98cae7d..f3fa4f11348cd8 100644 --- a/net/netfilter/xt_TRACE.c +++ b/net/netfilter/xt_TRACE.c @@ -29,25 +29,38 @@ trace_tg(struct sk_buff *skb, const struct xt_action_param *par) return XT_CONTINUE; } -static struct xt_target trace_tg_reg __read_mostly = { - .name = "TRACE", - .revision = 0, - .family = NFPROTO_UNSPEC, - .table = "raw", - .target = trace_tg, - .checkentry = trace_tg_check, - .destroy = trace_tg_destroy, - .me = THIS_MODULE, +static struct xt_target trace_tg_reg[] __read_mostly = { + { + .name = "TRACE", + .revision = 0, + .family = NFPROTO_IPV4, + .table = "raw", + .target = trace_tg, + .checkentry = trace_tg_check, + .destroy = trace_tg_destroy, + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "TRACE", + .revision = 0, + .family = NFPROTO_IPV6, + .table = "raw", + .target = trace_tg, + .checkentry = trace_tg_check, + .destroy = trace_tg_destroy, + }, +#endif }; static int __init trace_tg_init(void) { - return xt_register_target(&trace_tg_reg); + return xt_register_targets(trace_tg_reg, ARRAY_SIZE(trace_tg_reg)); } static void __exit trace_tg_exit(void) { - xt_unregister_target(&trace_tg_reg); + xt_unregister_targets(trace_tg_reg, ARRAY_SIZE(trace_tg_reg)); } module_init(trace_tg_init); diff --git a/net/netfilter/xt_addrtype.c b/net/netfilter/xt_addrtype.c index e9b2181e8c425f..a7708894310716 100644 --- a/net/netfilter/xt_addrtype.c +++ b/net/netfilter/xt_addrtype.c @@ -208,13 +208,24 @@ static struct xt_match addrtype_mt_reg[] __read_mostly = { }, { .name = "addrtype", - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .revision = 1, .match = addrtype_mt_v1, .checkentry = addrtype_mt_checkentry_v1, .matchsize = sizeof(struct xt_addrtype_info_v1), .me = THIS_MODULE - } + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "addrtype", + .family = NFPROTO_IPV6, + .revision = 1, + .match = addrtype_mt_v1, + .checkentry = addrtype_mt_checkentry_v1, + .matchsize = sizeof(struct xt_addrtype_info_v1), + .me = THIS_MODULE + }, +#endif }; static int __init addrtype_mt_init(void) diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c index a047a545371e18..908fd5f2c3c848 100644 --- a/net/netfilter/xt_cluster.c +++ b/net/netfilter/xt_cluster.c @@ -146,24 +146,37 @@ static void xt_cluster_mt_destroy(const struct xt_mtdtor_param *par) nf_ct_netns_put(par->net, par->family); } -static struct xt_match xt_cluster_match __read_mostly = { - .name = "cluster", - .family = NFPROTO_UNSPEC, - .match = xt_cluster_mt, - .checkentry = xt_cluster_mt_checkentry, - .matchsize = sizeof(struct xt_cluster_match_info), - .destroy = xt_cluster_mt_destroy, - .me = THIS_MODULE, +static struct xt_match xt_cluster_match[] __read_mostly = { + { + .name = "cluster", + .family = NFPROTO_IPV4, + .match = xt_cluster_mt, + .checkentry = xt_cluster_mt_checkentry, + .matchsize = sizeof(struct xt_cluster_match_info), + .destroy = xt_cluster_mt_destroy, + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "cluster", + .family = NFPROTO_IPV6, + .match = xt_cluster_mt, + .checkentry = xt_cluster_mt_checkentry, + .matchsize = sizeof(struct xt_cluster_match_info), + .destroy = xt_cluster_mt_destroy, + .me = THIS_MODULE, + }, +#endif }; static int __init xt_cluster_mt_init(void) { - return xt_register_match(&xt_cluster_match); + return xt_register_matches(xt_cluster_match, ARRAY_SIZE(xt_cluster_match)); } static void __exit xt_cluster_mt_fini(void) { - xt_unregister_match(&xt_cluster_match); + xt_unregister_matches(xt_cluster_match, ARRAY_SIZE(xt_cluster_match)); } MODULE_AUTHOR("Pablo Neira Ayuso "); diff --git a/net/netfilter/xt_connbytes.c b/net/netfilter/xt_connbytes.c index 93cb018c3055f8..2aabdcea870723 100644 --- a/net/netfilter/xt_connbytes.c +++ b/net/netfilter/xt_connbytes.c @@ -111,9 +111,11 @@ static int connbytes_mt_check(const struct xt_mtchk_param *par) return -EINVAL; ret = nf_ct_netns_get(par->net, par->family); - if (ret < 0) + if (ret < 0) { pr_info_ratelimited("cannot load conntrack support for proto=%u\n", par->family); + return ret; + } /* * This filter cannot function correctly unless connection tracking diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index 0e762277bcf8f9..0189f8b6b0bd15 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -117,26 +117,41 @@ static void connlimit_mt_destroy(const struct xt_mtdtor_param *par) nf_ct_netns_put(par->net, par->family); } -static struct xt_match connlimit_mt_reg __read_mostly = { - .name = "connlimit", - .revision = 1, - .family = NFPROTO_UNSPEC, - .checkentry = connlimit_mt_check, - .match = connlimit_mt, - .matchsize = sizeof(struct xt_connlimit_info), - .usersize = offsetof(struct xt_connlimit_info, data), - .destroy = connlimit_mt_destroy, - .me = THIS_MODULE, +static struct xt_match connlimit_mt_reg[] __read_mostly = { + { + .name = "connlimit", + .revision = 1, + .family = NFPROTO_IPV4, + .checkentry = connlimit_mt_check, + .match = connlimit_mt, + .matchsize = sizeof(struct xt_connlimit_info), + .usersize = offsetof(struct xt_connlimit_info, data), + .destroy = connlimit_mt_destroy, + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "connlimit", + .revision = 1, + .family = NFPROTO_IPV6, + .checkentry = connlimit_mt_check, + .match = connlimit_mt, + .matchsize = sizeof(struct xt_connlimit_info), + .usersize = offsetof(struct xt_connlimit_info, data), + .destroy = connlimit_mt_destroy, + .me = THIS_MODULE, + }, +#endif }; static int __init connlimit_mt_init(void) { - return xt_register_match(&connlimit_mt_reg); + return xt_register_matches(connlimit_mt_reg, ARRAY_SIZE(connlimit_mt_reg)); } static void __exit connlimit_mt_exit(void) { - xt_unregister_match(&connlimit_mt_reg); + xt_unregister_matches(connlimit_mt_reg, ARRAY_SIZE(connlimit_mt_reg)); } module_init(connlimit_mt_init); diff --git a/net/netfilter/xt_connmark.c b/net/netfilter/xt_connmark.c index ad3c033db64e70..4277084de2e70c 100644 --- a/net/netfilter/xt_connmark.c +++ b/net/netfilter/xt_connmark.c @@ -151,7 +151,7 @@ static struct xt_target connmark_tg_reg[] __read_mostly = { { .name = "CONNMARK", .revision = 1, - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .checkentry = connmark_tg_check, .target = connmark_tg, .targetsize = sizeof(struct xt_connmark_tginfo1), @@ -161,13 +161,35 @@ static struct xt_target connmark_tg_reg[] __read_mostly = { { .name = "CONNMARK", .revision = 2, - .family = NFPROTO_UNSPEC, + .family = NFPROTO_IPV4, .checkentry = connmark_tg_check, .target = connmark_tg_v2, .targetsize = sizeof(struct xt_connmark_tginfo2), .destroy = connmark_tg_destroy, .me = THIS_MODULE, - } + }, +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "CONNMARK", + .revision = 1, + .family = NFPROTO_IPV6, + .checkentry = connmark_tg_check, + .target = connmark_tg, + .targetsize = sizeof(struct xt_connmark_tginfo1), + .destroy = connmark_tg_destroy, + .me = THIS_MODULE, + }, + { + .name = "CONNMARK", + .revision = 2, + .family = NFPROTO_IPV6, + .checkentry = connmark_tg_check, + .target = connmark_tg_v2, + .targetsize = sizeof(struct xt_connmark_tginfo2), + .destroy = connmark_tg_destroy, + .me = THIS_MODULE, + }, +#endif }; static struct xt_match connmark_mt_reg __read_mostly = { diff --git a/net/netfilter/xt_mark.c b/net/netfilter/xt_mark.c index 1ad74b5920b533..f76fe04fc9a4e1 100644 --- a/net/netfilter/xt_mark.c +++ b/net/netfilter/xt_mark.c @@ -39,13 +39,35 @@ mark_mt(const struct sk_buff *skb, struct xt_action_param *par) return ((skb->mark & info->mask) == info->mark) ^ info->invert; } -static struct xt_target mark_tg_reg __read_mostly = { - .name = "MARK", - .revision = 2, - .family = NFPROTO_UNSPEC, - .target = mark_tg, - .targetsize = sizeof(struct xt_mark_tginfo2), - .me = THIS_MODULE, +static struct xt_target mark_tg_reg[] __read_mostly = { + { + .name = "MARK", + .revision = 2, + .family = NFPROTO_IPV4, + .target = mark_tg, + .targetsize = sizeof(struct xt_mark_tginfo2), + .me = THIS_MODULE, + }, +#if IS_ENABLED(CONFIG_IP_NF_ARPTABLES) + { + .name = "MARK", + .revision = 2, + .family = NFPROTO_ARP, + .target = mark_tg, + .targetsize = sizeof(struct xt_mark_tginfo2), + .me = THIS_MODULE, + }, +#endif +#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES) + { + .name = "MARK", + .revision = 2, + .family = NFPROTO_IPV4, + .target = mark_tg, + .targetsize = sizeof(struct xt_mark_tginfo2), + .me = THIS_MODULE, + }, +#endif }; static struct xt_match mark_mt_reg __read_mostly = { @@ -61,12 +83,12 @@ static int __init mark_mt_init(void) { int ret; - ret = xt_register_target(&mark_tg_reg); + ret = xt_register_targets(mark_tg_reg, ARRAY_SIZE(mark_tg_reg)); if (ret < 0) return ret; ret = xt_register_match(&mark_mt_reg); if (ret < 0) { - xt_unregister_target(&mark_tg_reg); + xt_unregister_targets(mark_tg_reg, ARRAY_SIZE(mark_tg_reg)); return ret; } return 0; @@ -75,7 +97,7 @@ static int __init mark_mt_init(void) static void __exit mark_mt_exit(void) { xt_unregister_match(&mark_mt_reg); - xt_unregister_target(&mark_tg_reg); + xt_unregister_targets(mark_tg_reg, ARRAY_SIZE(mark_tg_reg)); } module_init(mark_mt_init); From 05ef7055debc804e8083737402127975e7244fc4 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Wed, 9 Oct 2024 09:19:02 +0200 Subject: [PATCH 64/87] netfilter: fib: check correct rtable in vrf setups We need to init l3mdev unconditionally, else main routing table is searched and incorrect result is returned unless strict (iif keyword) matching is requested. Next patch adds a selftest for this. Fixes: 2a8a7c0eaa87 ("netfilter: nft_fib: Fix for rpath check with VRF devices") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761 Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/nft_fib_ipv4.c | 4 +--- net/ipv6/netfilter/nft_fib_ipv6.c | 5 +++-- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/net/ipv4/netfilter/nft_fib_ipv4.c b/net/ipv4/netfilter/nft_fib_ipv4.c index 00da1332bbf1a6..09fff5d424efc2 100644 --- a/net/ipv4/netfilter/nft_fib_ipv4.c +++ b/net/ipv4/netfilter/nft_fib_ipv4.c @@ -65,6 +65,7 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, .flowi4_scope = RT_SCOPE_UNIVERSE, .flowi4_iif = LOOPBACK_IFINDEX, .flowi4_uid = sock_net_uid(nft_net(pkt), NULL), + .flowi4_l3mdev = l3mdev_master_ifindex_rcu(nft_in(pkt)), }; const struct net_device *oif; const struct net_device *found; @@ -83,9 +84,6 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, else oif = NULL; - if (priv->flags & NFTA_FIB_F_IIF) - fl4.flowi4_l3mdev = l3mdev_master_ifindex_rcu(oif); - if (nft_hook(pkt) == NF_INET_PRE_ROUTING && nft_fib_is_loopback(pkt->skb, nft_in(pkt))) { nft_fib_store_result(dest, priv, nft_in(pkt)); diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c b/net/ipv6/netfilter/nft_fib_ipv6.c index 36dc14b34388c8..c9f1634b3838ae 100644 --- a/net/ipv6/netfilter/nft_fib_ipv6.c +++ b/net/ipv6/netfilter/nft_fib_ipv6.c @@ -41,8 +41,6 @@ static int nft_fib6_flowi_init(struct flowi6 *fl6, const struct nft_fib *priv, if (ipv6_addr_type(&fl6->daddr) & IPV6_ADDR_LINKLOCAL) { lookup_flags |= RT6_LOOKUP_F_IFACE; fl6->flowi6_oif = get_ifindex(dev ? dev : pkt->skb->dev); - } else if (priv->flags & NFTA_FIB_F_IIF) { - fl6->flowi6_l3mdev = l3mdev_master_ifindex_rcu(dev); } if (ipv6_addr_type(&fl6->saddr) & IPV6_ADDR_UNICAST) @@ -75,6 +73,8 @@ static u32 __nft_fib6_eval_type(const struct nft_fib *priv, else if (priv->flags & NFTA_FIB_F_OIF) dev = nft_out(pkt); + fl6.flowi6_l3mdev = l3mdev_master_ifindex_rcu(dev); + nft_fib6_flowi_init(&fl6, priv, pkt, dev, iph); if (dev && nf_ipv6_chk_addr(nft_net(pkt), &fl6.daddr, dev, true)) @@ -165,6 +165,7 @@ void nft_fib6_eval(const struct nft_expr *expr, struct nft_regs *regs, .flowi6_iif = LOOPBACK_IFINDEX, .flowi6_proto = pkt->tprot, .flowi6_uid = sock_net_uid(nft_net(pkt), NULL), + .flowi6_l3mdev = l3mdev_master_ifindex_rcu(nft_in(pkt)), }; struct rt6_info *rt; int lookup_flags; From c6a0862bee696cfb236a4e160a7f376c0ecdcf0c Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Wed, 9 Oct 2024 09:19:03 +0200 Subject: [PATCH 65/87] selftests: netfilter: conntrack_vrf.sh: add fib test case meta iifname veth0 ip daddr ... fib daddr oif ... is expected to return "dummy0" interface which is part of same vrf as veth0. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- .../selftests/net/netfilter/conntrack_vrf.sh | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/tools/testing/selftests/net/netfilter/conntrack_vrf.sh b/tools/testing/selftests/net/netfilter/conntrack_vrf.sh index 073e8e62d350b2..e95ecb37c2b145 100755 --- a/tools/testing/selftests/net/netfilter/conntrack_vrf.sh +++ b/tools/testing/selftests/net/netfilter/conntrack_vrf.sh @@ -32,6 +32,7 @@ source lib.sh IP0=172.30.30.1 IP1=172.30.30.2 +DUMMYNET=10.9.9 PFXL=30 ret=0 @@ -54,6 +55,7 @@ setup_ns ns0 ns1 ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.default.rp_filter=0 ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.rp_filter=0 ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.rp_filter=0 +ip netns exec "$ns0" sysctl -q -w net.ipv4.conf.all.forwarding=1 if ! ip link add veth0 netns "$ns0" type veth peer name veth0 netns "$ns1" > /dev/null 2>&1; then echo "SKIP: Could not add veth device" @@ -65,13 +67,18 @@ if ! ip -net "$ns0" li add tvrf type vrf table 9876; then exit $ksft_skip fi +ip -net "$ns0" link add dummy0 type dummy + ip -net "$ns0" li set veth0 master tvrf +ip -net "$ns0" li set dummy0 master tvrf ip -net "$ns0" li set tvrf up ip -net "$ns0" li set veth0 up +ip -net "$ns0" li set dummy0 up ip -net "$ns1" li set veth0 up ip -net "$ns0" addr add $IP0/$PFXL dev veth0 ip -net "$ns1" addr add $IP1/$PFXL dev veth0 +ip -net "$ns0" addr add $DUMMYNET.1/$PFXL dev dummy0 listener_ready() { @@ -212,9 +219,35 @@ EOF fi } +test_fib() +{ +ip netns exec "$ns0" nft -f - < /dev/null + + if ip netns exec "$ns0" nft list counter t fibcount | grep -q "packets 1"; then + echo "PASS: fib lookup returned exepected output interface" + else + echo "FAIL: fib lookup did not return exepected output interface" + ret=1 + return + fi +} + test_ct_zone_in test_masquerade_vrf "default" test_masquerade_vrf "pfifo" test_masquerade_veth +test_fib exit $ret From 70a0da8c113555fe14bb6db8e5180f8fc2c18385 Mon Sep 17 00:00:00 2001 From: Jacky Chou Date: Mon, 7 Oct 2024 11:24:35 +0800 Subject: [PATCH 66/87] net: ftgmac100: fixed not check status from fixed phy Add error handling from calling fixed_phy_register. It may return some error, therefore, need to check the status. And fixed_phy_register needs to bind a device node for mdio. Add the mac device node for fixed_phy_register function. This is a reference to this function, of_phy_register_fixed_link(). Fixes: e24a6c874601 ("net: ftgmac100: Get link speed and duplex for NC-SI") Signed-off-by: Jacky Chou Link: https://patch.msgid.link/20241007032435.787892-1-jacky_chou@aspeedtech.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/faraday/ftgmac100.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c index f3cc14cc757d27..0b61f548fd188f 100644 --- a/drivers/net/ethernet/faraday/ftgmac100.c +++ b/drivers/net/ethernet/faraday/ftgmac100.c @@ -1906,7 +1906,12 @@ static int ftgmac100_probe(struct platform_device *pdev) goto err_phy_connect; } - phydev = fixed_phy_register(PHY_POLL, &ncsi_phy_status, NULL); + phydev = fixed_phy_register(PHY_POLL, &ncsi_phy_status, np); + if (IS_ERR(phydev)) { + dev_err(&pdev->dev, "failed to register fixed PHY device\n"); + err = PTR_ERR(phydev); + goto err_phy_connect; + } err = phy_connect_direct(netdev, phydev, ftgmac100_adjust_link, PHY_INTERFACE_MODE_MII); if (err) { From 080ddc22f3b0a58500f87e8e865aabbf96495eea Mon Sep 17 00:00:00 2001 From: Rosen Penev Date: Tue, 8 Oct 2024 16:30:50 -0700 Subject: [PATCH 67/87] net: ibm: emac: mal: add dcr_unmap to _remove It's done in probe so it should be undone here. Fixes: 1d3bb996481e ("Device tree aware EMAC driver") Signed-off-by: Rosen Penev Reviewed-by: Breno Leitao Link: https://patch.msgid.link/20241008233050.9422-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/ibm/emac/mal.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/ibm/emac/mal.c b/drivers/net/ethernet/ibm/emac/mal.c index 0c5e22d14372ac..99d5f83f7c60b5 100644 --- a/drivers/net/ethernet/ibm/emac/mal.c +++ b/drivers/net/ethernet/ibm/emac/mal.c @@ -742,6 +742,8 @@ static void mal_remove(struct platform_device *ofdev) free_netdev(mal->dummy_dev); + dcr_unmap(mal->dcr_host, 0x100); + dma_free_coherent(&ofdev->dev, sizeof(struct mal_descriptor) * (NUM_TX_BUFF * mal->num_tx_chans + From 6be063071a457767ee229db13f019c2ec03bfe44 Mon Sep 17 00:00:00 2001 From: Wei Fang Date: Tue, 8 Oct 2024 14:11:53 +0800 Subject: [PATCH 68/87] net: fec: don't save PTP state if PTP is unsupported MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Some platforms (such as i.MX25 and i.MX27) do not support PTP, so on these platforms fec_ptp_init() is not called and the related members in fep are not initialized. However, fec_ptp_save_state() is called unconditionally, which causes the kernel to panic. Therefore, add a condition so that fec_ptp_save_state() is not called if PTP is not supported. Fixes: a1477dc87dc4 ("net: fec: Restart PPS after link state change") Reported-by: Guenter Roeck Closes: https://lore.kernel.org/lkml/353e41fe-6bb4-4ee9-9980-2da2a9c1c508@roeck-us.net/ Signed-off-by: Wei Fang Reviewed-by: Csókás, Bence Reviewed-by: Simon Horman Tested-by: Guenter Roeck Link: https://patch.msgid.link/20241008061153.1977930-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/freescale/fec_main.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 31ebf6a4f973bd..9d9fcec41488e3 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -1077,7 +1077,8 @@ fec_restart(struct net_device *ndev) u32 rcntl = OPT_FRAME_SIZE | 0x04; u32 ecntl = FEC_ECR_ETHEREN; - fec_ptp_save_state(fep); + if (fep->bufdesc_ex) + fec_ptp_save_state(fep); /* Whack a reset. We should wait for this. * For i.MX6SX SOC, enet use AXI bus, we use disable MAC @@ -1340,7 +1341,8 @@ fec_stop(struct net_device *ndev) netdev_err(ndev, "Graceful transmit stop did not complete!\n"); } - fec_ptp_save_state(fep); + if (fep->bufdesc_ex) + fec_ptp_save_state(fep); /* Whack a reset. We should wait for this. * For i.MX6SX SOC, enet use AXI bus, we use disable MAC From 8c924369cb56c3054dca504c2c9c3eb208272865 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Tue, 8 Oct 2024 12:43:20 +0300 Subject: [PATCH 69/87] net: dsa: refuse cross-chip mirroring operations In case of a tc mirred action from one switch to another, the behavior is not correct. We simply tell the source switch driver to program a mirroring entry towards mirror->to_local_port = to_dp->index, but it is not even guaranteed that the to_dp belongs to the same switch as dp. For proper cross-chip support, we would need to go through the cross-chip notifier layer in switch.c, program the entry on cascade ports, and introduce new, explicit API for cross-chip mirroring, given that intermediary switches should have introspection into the DSA tags passed through the cascade port (and not just program a port mirror on the entire cascade port). None of that exists today. Reject what is not implemented so that user space is not misled into thinking it works. Fixes: f50f212749e8 ("net: dsa: Add plumbing for port mirroring") Signed-off-by: Vladimir Oltean Reviewed-by: Andrew Lunn Link: https://patch.msgid.link/20241008094320.3340980-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- net/dsa/user.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/net/dsa/user.c b/net/dsa/user.c index 74eda9b30608e6..64f660d2334b77 100644 --- a/net/dsa/user.c +++ b/net/dsa/user.c @@ -1392,6 +1392,14 @@ dsa_user_add_cls_matchall_mirred(struct net_device *dev, if (!dsa_user_dev_check(act->dev)) return -EOPNOTSUPP; + to_dp = dsa_user_to_port(act->dev); + + if (dp->ds != to_dp->ds) { + NL_SET_ERR_MSG_MOD(extack, + "Cross-chip mirroring not implemented"); + return -EOPNOTSUPP; + } + mall_tc_entry = kzalloc(sizeof(*mall_tc_entry), GFP_KERNEL); if (!mall_tc_entry) return -ENOMEM; @@ -1399,9 +1407,6 @@ dsa_user_add_cls_matchall_mirred(struct net_device *dev, mall_tc_entry->cookie = cls->cookie; mall_tc_entry->type = DSA_PORT_MALL_MIRROR; mirror = &mall_tc_entry->mirror; - - to_dp = dsa_user_to_port(act->dev); - mirror->to_local_port = to_dp->index; mirror->ingress = ingress; From d94785bb46b6167382b1de3290eccc91fa98df53 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Tue, 8 Oct 2024 02:43:24 -0700 Subject: [PATCH 70/87] net: netconsole: fix wrong warning A warning is triggered when there is insufficient space in the buffer for userdata. However, this is not an issue since userdata will be sent in the next iteration. Current warning message: ------------[ cut here ]------------ WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0 ? write_ext_msg+0x3b6/0x3d0 console_flush_all+0x1e9/0x330 The code incorrectly issues a warning when this_chunk is zero, which is a valid scenario. The warning should only be triggered when this_chunk is negative. Fixes: 1ec9daf95093 ("net: netconsole: append userdata to fragmented netconsole messages") Signed-off-by: Breno Leitao Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241008094325.896208-1-leitao@debian.org Signed-off-by: Jakub Kicinski --- drivers/net/netconsole.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index 01cf33fa75036c..de20928f740211 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -1161,8 +1161,14 @@ static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg, this_chunk = min(userdata_len - sent_userdata, MAX_PRINT_CHUNK - preceding_bytes); - if (WARN_ON_ONCE(this_chunk <= 0)) + if (WARN_ON_ONCE(this_chunk < 0)) + /* this_chunk could be zero if all the previous + * message used all the buffer. This is not a + * problem, userdata will be sent in the next + * iteration + */ return; + memcpy(buf + this_header + this_offset, userdata + sent_userdata, this_chunk); From e32d262c89e2b22cb0640223f953b548617ed8a6 Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Tue, 8 Oct 2024 13:04:52 +0200 Subject: [PATCH 71/87] mptcp: handle consistently DSS corruption Bugged peer implementation can send corrupted DSS options, consistently hitting a few warning in the data path. Use DEBUG_NET assertions, to avoid the splat on some builds and handle consistently the error, dumping related MIBs and performing fallback and/or reset according to the subflow type. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/mib.c | 2 ++ net/mptcp/mib.h | 2 ++ net/mptcp/protocol.c | 24 +++++++++++++++++++++--- net/mptcp/subflow.c | 4 +++- 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index 38c2efc82b948d..ad88bd3c58dffe 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -32,6 +32,8 @@ static const struct snmp_mib mptcp_snmp_list[] = { SNMP_MIB_ITEM("MPJoinSynTxBindErr", MPTCP_MIB_JOINSYNTXBINDERR), SNMP_MIB_ITEM("MPJoinSynTxConnectErr", MPTCP_MIB_JOINSYNTXCONNECTERR), SNMP_MIB_ITEM("DSSNotMatching", MPTCP_MIB_DSSNOMATCH), + SNMP_MIB_ITEM("DSSCorruptionFallback", MPTCP_MIB_DSSCORRUPTIONFALLBACK), + SNMP_MIB_ITEM("DSSCorruptionReset", MPTCP_MIB_DSSCORRUPTIONRESET), SNMP_MIB_ITEM("InfiniteMapTx", MPTCP_MIB_INFINITEMAPTX), SNMP_MIB_ITEM("InfiniteMapRx", MPTCP_MIB_INFINITEMAPRX), SNMP_MIB_ITEM("DSSNoMatchTCP", MPTCP_MIB_DSSTCPMISMATCH), diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index c8ffe18a872217..3206cdda8bb106 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -27,6 +27,8 @@ enum linux_mptcp_mib_field { MPTCP_MIB_JOINSYNTXBINDERR, /* Not able to bind() the address when sending a SYN + MP_JOIN */ MPTCP_MIB_JOINSYNTXCONNECTERR, /* Not able to connect() when sending a SYN + MP_JOIN */ MPTCP_MIB_DSSNOMATCH, /* Received a new mapping that did not match the previous one */ + MPTCP_MIB_DSSCORRUPTIONFALLBACK,/* DSS corruption detected, fallback */ + MPTCP_MIB_DSSCORRUPTIONRESET, /* DSS corruption detected, MPJ subflow reset */ MPTCP_MIB_INFINITEMAPTX, /* Sent an infinite mapping */ MPTCP_MIB_INFINITEMAPRX, /* Received an infinite mapping */ MPTCP_MIB_DSSTCPMISMATCH, /* DSS-mapping did not map with TCP's sequence numbers */ diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index c2317919fc148a..6d0e201c3eb26a 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -620,6 +620,18 @@ static bool mptcp_check_data_fin(struct sock *sk) return ret; } +static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk) +{ + if (READ_ONCE(msk->allow_infinite_fallback)) { + MPTCP_INC_STATS(sock_net(ssk), + MPTCP_MIB_DSSCORRUPTIONFALLBACK); + mptcp_do_fallback(ssk); + } else { + MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_DSSCORRUPTIONRESET); + mptcp_subflow_reset(ssk); + } +} + static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, struct sock *ssk, unsigned int *bytes) @@ -692,10 +704,16 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, moved += len; seq += len; - if (WARN_ON_ONCE(map_remaining < len)) - break; + if (unlikely(map_remaining < len)) { + DEBUG_NET_WARN_ON_ONCE(1); + mptcp_dss_corruption(msk, ssk); + } } else { - WARN_ON_ONCE(!fin); + if (unlikely(!fin)) { + DEBUG_NET_WARN_ON_ONCE(1); + mptcp_dss_corruption(msk, ssk); + } + sk_eat_skb(ssk, skb); done = true; } diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 1040b3b9696b74..e1046a696ab5c7 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -975,8 +975,10 @@ static bool skb_is_fully_mapped(struct sock *ssk, struct sk_buff *skb) unsigned int skb_consumed; skb_consumed = tcp_sk(ssk)->copied_seq - TCP_SKB_CB(skb)->seq; - if (WARN_ON_ONCE(skb_consumed >= skb->len)) + if (unlikely(skb_consumed >= skb->len)) { + DEBUG_NET_WARN_ON_ONCE(1); return true; + } return skb->len - skb_consumed <= subflow->map_data_len - mptcp_subflow_get_map_offset(subflow); From 4dabcdf581217e60690467a37c956a5b8dbc6bd9 Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Tue, 8 Oct 2024 13:04:53 +0200 Subject: [PATCH 72/87] tcp: fix mptcp DSS corruption due to large pmtu xmit Syzkaller was able to trigger a DSS corruption: TCP: request_sock_subflow_v4: Possible SYN flooding on port [::]:20002. Sending cookies. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 5227 at net/mptcp/protocol.c:695 __mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695 Modules linked in: CPU: 0 UID: 0 PID: 5227 Comm: syz-executor350 Not tainted 6.11.0-syzkaller-08829-gaf9c191ac2a0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 RIP: 0010:__mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695 Code: 0f b6 dc 31 ff 89 de e8 b5 dd ea f5 89 d8 48 81 c4 50 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 98 da ea f5 90 <0f> 0b 90 e9 47 ff ff ff e8 8a da ea f5 90 0f 0b 90 e9 99 e0 ff ff RSP: 0018:ffffc90000006db8 EFLAGS: 00010246 RAX: ffffffff8ba9df18 RBX: 00000000000055f0 RCX: ffff888030023c00 RDX: 0000000000000100 RSI: 00000000000081e5 RDI: 00000000000055f0 RBP: 1ffff110062bf1ae R08: ffffffff8ba9cf12 R09: 1ffff110062bf1b8 R10: dffffc0000000000 R11: ffffed10062bf1b9 R12: 0000000000000000 R13: dffffc0000000000 R14: 00000000700cec61 R15: 00000000000081e5 FS: 000055556679c380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020287000 CR3: 0000000077892000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: move_skbs_to_msk net/mptcp/protocol.c:811 [inline] mptcp_data_ready+0x29c/0xa90 net/mptcp/protocol.c:854 subflow_data_ready+0x34a/0x920 net/mptcp/subflow.c:1490 tcp_data_queue+0x20fd/0x76c0 net/ipv4/tcp_input.c:5283 tcp_rcv_established+0xfba/0x2020 net/ipv4/tcp_input.c:6237 tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915 tcp_v4_rcv+0x2dc0/0x37f0 net/ipv4/tcp_ipv4.c:2350 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5662 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6107 __napi_poll+0xcb/0x490 net/core/dev.c:6771 napi_poll net/core/dev.c:6840 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6962 handle_softirqs+0x2c5/0x980 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline] __dev_queue_xmit+0x1764/0x3e80 net/core/dev.c:4451 dev_queue_xmit include/linux/netdevice.h:3094 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:236 ip_local_out net/ipv4/ip_output.c:130 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:536 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline] tcp_mtu_probe net/ipv4/tcp_output.c:2547 [inline] tcp_write_xmit+0x641d/0x6bf0 net/ipv4/tcp_output.c:2752 __tcp_push_pending_frames+0x9b/0x360 net/ipv4/tcp_output.c:3015 tcp_push_pending_frames include/net/tcp.h:2107 [inline] tcp_data_snd_check net/ipv4/tcp_input.c:5714 [inline] tcp_rcv_established+0x1026/0x2020 net/ipv4/tcp_input.c:6239 tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915 sk_backlog_rcv include/net/sock.h:1113 [inline] __release_sock+0x214/0x350 net/core/sock.c:3072 release_sock+0x61/0x1f0 net/core/sock.c:3626 mptcp_push_release net/mptcp/protocol.c:1486 [inline] __mptcp_push_pending+0x6b5/0x9f0 net/mptcp/protocol.c:1625 mptcp_sendmsg+0x10bb/0x1b10 net/mptcp/protocol.c:1903 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2603 ___sys_sendmsg net/socket.c:2657 [inline] __sys_sendmsg+0x2aa/0x390 net/socket.c:2686 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb06e9317f9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe2cfd4f98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fb06e97f468 RCX: 00007fb06e9317f9 RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000005 RBP: 00007fb06e97f446 R08: 0000555500000000 R09: 0000555500000000 R10: 0000555500000000 R11: 0000000000000246 R12: 00007fb06e97f406 R13: 0000000000000001 R14: 00007ffe2cfd4fe0 R15: 0000000000000003 Additionally syzkaller provided a nice reproducer. The repro enables pmtu on the loopback device, leading to tcp_mtu_probe() generating very large probe packets. tcp_can_coalesce_send_queue_head() currently does not check for mptcp-level invariants, and allowed the creation of cross-DSS probes, leading to the mentioned corruption. Address the issue teaching tcp_can_coalesce_send_queue_head() about mptcp using the tcp_skb_can_collapse(), also reducing the code duplication. Fixes: 85712484110d ("tcp: coalesce/collapse must respect MPTCP extensions") Cc: stable@vger.kernel.org Reported-by: syzbot+d1bff73460e33101f0e7@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/513 Signed-off-by: Paolo Abeni Acked-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-2-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski --- net/ipv4/tcp_output.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 4fd746bd4d54f6..68804fd01dafc4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2342,10 +2342,7 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (len <= skb->len) break; - if (unlikely(TCP_SKB_CB(skb)->eor) || - tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next) || - skb_frags_readable(skb) != skb_frags_readable(next)) + if (tcp_has_tx_tstamp(skb) || !tcp_skb_can_collapse(skb, next)) return false; len -= skb->len; From 119d51e225febc8152476340a880f5415a01e99e Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Tue, 8 Oct 2024 13:04:54 +0200 Subject: [PATCH 73/87] mptcp: fallback when MPTCP opts are dropped after 1st data As reported by Christoph [1], before this patch, an MPTCP connection was wrongly reset when a host received a first data packet with MPTCP options after the 3wHS, but got the next ones without. According to the MPTCP v1 specs [2], a fallback should happen in this case, because the host didn't receive a DATA_ACK from the other peer, nor receive data for more than the initial window which implies a DATA_ACK being received by the other peer. The patch here re-uses the same logic as the one used in other places: by looking at allow_infinite_fallback, which is disabled at the creation of an additional subflow. It's not looking at the first DATA_ACK (or implying one received from the other side) as suggested by the RFC, but it is in continuation with what was already done, which is safer, and it fixes the reported issue. The next step, looking at this first DATA_ACK, is tracked in [4]. This patch has been validated using the following Packetdrill script: 0 socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // 3WHS is OK +0.0 < S 0:0(0) win 65535 +0.0 > S. 0:0(0) ack 1 +0.1 < . 1:1(0) ack 1 win 2048 +0 accept(3, ..., ...) = 4 // Data from the client with valid MPTCP options (no DATA_ACK: normal) +0.1 < P. 1:501(500) ack 1 win 2048 // From here, the MPTCP options will be dropped by a middlebox +0.0 > . 1:1(0) ack 501 +0.1 read(4, ..., 500) = 500 +0 write(4, ..., 100) = 100 // The server replies with data, still thinking MPTCP is being used +0.0 > P. 1:101(100) ack 501 // But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options +0.1 < . 501:501(0) ack 101 win 2048 +0.0 < P. 501:601(100) ack 101 win 2048 // The server should fallback to TCP, not reset: it didn't get a DATA_ACK, nor data for more than the initial window +0.0 > . 101:101(0) ack 601 Note that this script requires Packetdrill with MPTCP support, see [3]. Fixes: dea2b1ea9c70 ("mptcp: do not reset MP_CAPABLE subflow on mapping errors") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/518 [1] Link: https://datatracker.ietf.org/doc/html/rfc8684#name-fallback [2] Link: https://github.com/multipath-tcp/packetdrill [3] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/519 [4] Reviewed-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-3-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/subflow.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index e1046a696ab5c7..25dde81bcb7575 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -1282,7 +1282,7 @@ static bool subflow_can_fallback(struct mptcp_subflow_context *subflow) else if (READ_ONCE(msk->csum_enabled)) return !subflow->valid_csum_seen; else - return !subflow->fully_established; + return READ_ONCE(msk->allow_infinite_fallback); } static void mptcp_subflow_fail(struct mptcp_sock *msk, struct sock *ssk) From db0a37b7ac27d8ca27d3dc676a16d081c16ec7b9 Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Tue, 8 Oct 2024 13:04:55 +0200 Subject: [PATCH 74/87] mptcp: pm: do not remove closing subflows In a previous fix, the in-kernel path-manager has been modified not to retrigger the removal of a subflow if it was already closed, e.g. when the initial subflow is removed, but kept in the subflows list. To be complete, this fix should also skip the subflows that are in any closing state: mptcp_close_ssk() will initiate the closure, but the switch to the TCP_CLOSE state depends on the other peer. Fixes: 58e1b66b4e4b ("mptcp: pm: do not remove already closed subflows") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni Acked-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-4-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/pm_netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c index 64fe0e7d87d732..f6f0a38a0750f8 100644 --- a/net/mptcp/pm_netlink.c +++ b/net/mptcp/pm_netlink.c @@ -860,7 +860,8 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk, int how = RCV_SHUTDOWN | SEND_SHUTDOWN; u8 id = subflow_get_local_id(subflow); - if (inet_sk_state_load(ssk) == TCP_CLOSE) + if ((1 << inet_sk_state_load(ssk)) & + (TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2 | TCPF_CLOSING | TCPF_CLOSE)) continue; if (rm_type == MPTCP_MIB_RMADDR && remote_id != rm_id) continue; From ac888d58869bb99753e7652be19a151df9ecb35d Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 8 Oct 2024 14:31:10 +0000 Subject: [PATCH 75/87] net: do not delay dst_entries_add() in dst_release() dst_entries_add() uses per-cpu data that might be freed at netns dismantle from ip6_route_net_exit() calling dst_entries_destroy() Before ip6_route_net_exit() can be called, we release all the dsts associated with this netns, via calls to dst_release(), which waits an rcu grace period before calling dst_destroy() dst_entries_add() use in dst_destroy() is racy, because dst_entries_destroy() could have been called already. Decrementing the number of dsts must happen sooner. Notes: 1) in CONFIG_XFRM case, dst_destroy() can call dst_release_immediate(child), this might also cause UAF if the child does not have DST_NOCOUNT set. IPSEC maintainers might take a look and see how to address this. 2) There is also discussion about removing this count of dst, which might happen in future kernels. Fixes: f88649721268 ("ipv4: fix dst race in sk_dst_get()") Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/ Reported-by: Naresh Kamboju Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Eric Dumazet Cc: Xin Long Cc: Steffen Klassert Reviewed-by: Xin Long Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com Signed-off-by: Paolo Abeni --- net/core/dst.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/net/core/dst.c b/net/core/dst.c index 95f533844f17f1..9552a90d4772dc 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -109,9 +109,6 @@ static void dst_destroy(struct dst_entry *dst) child = xdst->child; } #endif - if (!(dst->flags & DST_NOCOUNT)) - dst_entries_add(dst->ops, -1); - if (dst->ops->destroy) dst->ops->destroy(dst); netdev_put(dst->dev, &dst->dev_tracker); @@ -159,17 +156,27 @@ void dst_dev_put(struct dst_entry *dst) } EXPORT_SYMBOL(dst_dev_put); +static void dst_count_dec(struct dst_entry *dst) +{ + if (!(dst->flags & DST_NOCOUNT)) + dst_entries_add(dst->ops, -1); +} + void dst_release(struct dst_entry *dst) { - if (dst && rcuref_put(&dst->__rcuref)) + if (dst && rcuref_put(&dst->__rcuref)) { + dst_count_dec(dst); call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu); + } } EXPORT_SYMBOL(dst_release); void dst_release_immediate(struct dst_entry *dst) { - if (dst && rcuref_put(&dst->__rcuref)) + if (dst && rcuref_put(&dst->__rcuref)) { + dst_count_dec(dst); dst_destroy(dst); + } } EXPORT_SYMBOL(dst_release_immediate); From 07cc7b0b942bf55ef1a471470ecda8d2a6a6541f Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:32 -0700 Subject: [PATCH 76/87] rtnetlink: Add bulk registration helpers for rtnetlink message handlers. Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers"), once rtnl_msg_handlers[protocol] was allocated, the following rtnl_register_module() for the same protocol never failed. However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to be allocated in each rtnl_register_module(), so each call could fail. Many callers of rtnl_register_module() do not handle the returned error, and we need to add many error handlings. To handle that easily, let's add wrapper functions for bulk registration of rtnetlink message handlers. Signed-off-by: Kuniyuki Iwashima Signed-off-by: Paolo Abeni --- include/net/rtnetlink.h | 17 +++++++++++++++++ net/core/rtnetlink.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index b45d57b5968af4..2d3eb7cb4dfff0 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -29,6 +29,15 @@ static inline enum rtnl_kinds rtnl_msgtype_kind(int msgtype) return msgtype & RTNL_KIND_MASK; } +struct rtnl_msg_handler { + struct module *owner; + int protocol; + int msgtype; + rtnl_doit_func doit; + rtnl_dumpit_func dumpit; + int flags; +}; + void rtnl_register(int protocol, int msgtype, rtnl_doit_func, rtnl_dumpit_func, unsigned int flags); int rtnl_register_module(struct module *owner, int protocol, int msgtype, @@ -36,6 +45,14 @@ int rtnl_register_module(struct module *owner, int protocol, int msgtype, int rtnl_unregister(int protocol, int msgtype); void rtnl_unregister_all(int protocol); +int __rtnl_register_many(const struct rtnl_msg_handler *handlers, int n); +void __rtnl_unregister_many(const struct rtnl_msg_handler *handlers, int n); + +#define rtnl_register_many(handlers) \ + __rtnl_register_many(handlers, ARRAY_SIZE(handlers)) +#define rtnl_unregister_many(handlers) \ + __rtnl_unregister_many(handlers, ARRAY_SIZE(handlers)) + static inline int rtnl_msg_family(const struct nlmsghdr *nlh) { if (nlmsg_len(nlh) >= sizeof(struct rtgenmsg)) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index f0a52098708584..e30e7ea0207d0c 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -384,6 +384,35 @@ void rtnl_unregister_all(int protocol) } EXPORT_SYMBOL_GPL(rtnl_unregister_all); +int __rtnl_register_many(const struct rtnl_msg_handler *handlers, int n) +{ + const struct rtnl_msg_handler *handler; + int i, err; + + for (i = 0, handler = handlers; i < n; i++, handler++) { + err = rtnl_register_internal(handler->owner, handler->protocol, + handler->msgtype, handler->doit, + handler->dumpit, handler->flags); + if (err) { + __rtnl_unregister_many(handlers, i); + break; + } + } + + return err; +} +EXPORT_SYMBOL_GPL(__rtnl_register_many); + +void __rtnl_unregister_many(const struct rtnl_msg_handler *handlers, int n) +{ + const struct rtnl_msg_handler *handler; + int i; + + for (i = n - 1, handler = handlers + n - 1; i >= 0; i--, handler--) + rtnl_unregister(handler->protocol, handler->msgtype); +} +EXPORT_SYMBOL_GPL(__rtnl_unregister_many); + static LIST_HEAD(link_ops); static const struct rtnl_link_ops *rtnl_link_ops_get(const char *kind) From 78b7b991838a4a6baeaad934addc4db2c5917eb8 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:33 -0700 Subject: [PATCH 77/87] vxlan: Handle error of rtnl_register_module(). Since introduced, vxlan_vnifilter_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Signed-off-by: Kuniyuki Iwashima Reviewed-by: Nikolay Aleksandrov Signed-off-by: Paolo Abeni --- drivers/net/vxlan/vxlan_core.c | 6 +++++- drivers/net/vxlan/vxlan_private.h | 2 +- drivers/net/vxlan/vxlan_vnifilter.c | 19 +++++++++---------- 3 files changed, 15 insertions(+), 12 deletions(-) diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 53dcb9fffc04fe..6e9a3795846aa3 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -4913,9 +4913,13 @@ static int __init vxlan_init_module(void) if (rc) goto out4; - vxlan_vnifilter_init(); + rc = vxlan_vnifilter_init(); + if (rc) + goto out5; return 0; +out5: + rtnl_link_unregister(&vxlan_link_ops); out4: unregister_switchdev_notifier(&vxlan_switchdev_notifier_block); out3: diff --git a/drivers/net/vxlan/vxlan_private.h b/drivers/net/vxlan/vxlan_private.h index b35d96b7884378..76a351a997d510 100644 --- a/drivers/net/vxlan/vxlan_private.h +++ b/drivers/net/vxlan/vxlan_private.h @@ -202,7 +202,7 @@ int vxlan_vni_in_use(struct net *src_net, struct vxlan_dev *vxlan, int vxlan_vnigroup_init(struct vxlan_dev *vxlan); void vxlan_vnigroup_uninit(struct vxlan_dev *vxlan); -void vxlan_vnifilter_init(void); +int vxlan_vnifilter_init(void); void vxlan_vnifilter_uninit(void); void vxlan_vnifilter_count(struct vxlan_dev *vxlan, __be32 vni, struct vxlan_vni_node *vninode, diff --git a/drivers/net/vxlan/vxlan_vnifilter.c b/drivers/net/vxlan/vxlan_vnifilter.c index 9c59d0bf8c3de0..d2023e7131bd4f 100644 --- a/drivers/net/vxlan/vxlan_vnifilter.c +++ b/drivers/net/vxlan/vxlan_vnifilter.c @@ -992,19 +992,18 @@ static int vxlan_vnifilter_process(struct sk_buff *skb, struct nlmsghdr *nlh, return err; } -void vxlan_vnifilter_init(void) +static const struct rtnl_msg_handler vxlan_vnifilter_rtnl_msg_handlers[] = { + {THIS_MODULE, PF_BRIDGE, RTM_GETTUNNEL, NULL, vxlan_vnifilter_dump, 0}, + {THIS_MODULE, PF_BRIDGE, RTM_NEWTUNNEL, vxlan_vnifilter_process, NULL, 0}, + {THIS_MODULE, PF_BRIDGE, RTM_DELTUNNEL, vxlan_vnifilter_process, NULL, 0}, +}; + +int vxlan_vnifilter_init(void) { - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_GETTUNNEL, NULL, - vxlan_vnifilter_dump, 0); - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_NEWTUNNEL, - vxlan_vnifilter_process, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_DELTUNNEL, - vxlan_vnifilter_process, NULL, 0); + return rtnl_register_many(vxlan_vnifilter_rtnl_msg_handlers); } void vxlan_vnifilter_uninit(void) { - rtnl_unregister(PF_BRIDGE, RTM_GETTUNNEL); - rtnl_unregister(PF_BRIDGE, RTM_NEWTUNNEL); - rtnl_unregister(PF_BRIDGE, RTM_DELTUNNEL); + rtnl_unregister_many(vxlan_vnifilter_rtnl_msg_handlers); } From cba5e43b0b757734b1e79f624d93a71435e31136 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:34 -0700 Subject: [PATCH 78/87] bridge: Handle error of rtnl_register_module(). Since introduced, br_vlan_rtnl_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support") Fixes: f26b296585dc ("net: bridge: vlan: add new rtm message support") Fixes: adb3ce9bcb0f ("net: bridge: vlan: add del rtm message support") Signed-off-by: Kuniyuki Iwashima Acked-by: Nikolay Aleksandrov Signed-off-by: Paolo Abeni --- net/bridge/br_netlink.c | 6 +++++- net/bridge/br_private.h | 5 +++-- net/bridge/br_vlan.c | 19 +++++++++---------- 3 files changed, 17 insertions(+), 13 deletions(-) diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index f17dbac7d82843..6b97ae47f85524 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -1920,7 +1920,10 @@ int __init br_netlink_init(void) { int err; - br_vlan_rtnl_init(); + err = br_vlan_rtnl_init(); + if (err) + goto out; + rtnl_af_register(&br_af_ops); err = rtnl_link_register(&br_link_ops); @@ -1931,6 +1934,7 @@ int __init br_netlink_init(void) out_af: rtnl_af_unregister(&br_af_ops); +out: return err; } diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index d4bedc87b1d8f1..041f6e571a2097 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -1571,7 +1571,7 @@ void br_vlan_get_stats(const struct net_bridge_vlan *v, void br_vlan_port_event(struct net_bridge_port *p, unsigned long event); int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr); -void br_vlan_rtnl_init(void); +int br_vlan_rtnl_init(void); void br_vlan_rtnl_uninit(void); void br_vlan_notify(const struct net_bridge *br, const struct net_bridge_port *p, @@ -1802,8 +1802,9 @@ static inline int br_vlan_bridge_event(struct net_device *dev, return 0; } -static inline void br_vlan_rtnl_init(void) +static inline int br_vlan_rtnl_init(void) { + return 0; } static inline void br_vlan_rtnl_uninit(void) diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 9c2fffb827ab19..89f51ea4cabece 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -2296,19 +2296,18 @@ static int br_vlan_rtm_process(struct sk_buff *skb, struct nlmsghdr *nlh, return err; } -void br_vlan_rtnl_init(void) +static const struct rtnl_msg_handler br_vlan_rtnl_msg_handlers[] = { + {THIS_MODULE, PF_BRIDGE, RTM_NEWVLAN, br_vlan_rtm_process, NULL, 0}, + {THIS_MODULE, PF_BRIDGE, RTM_DELVLAN, br_vlan_rtm_process, NULL, 0}, + {THIS_MODULE, PF_BRIDGE, RTM_GETVLAN, NULL, br_vlan_rtm_dump, 0}, +}; + +int br_vlan_rtnl_init(void) { - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_GETVLAN, NULL, - br_vlan_rtm_dump, 0); - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_NEWVLAN, - br_vlan_rtm_process, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_DELVLAN, - br_vlan_rtm_process, NULL, 0); + return rtnl_register_many(br_vlan_rtnl_msg_handlers); } void br_vlan_rtnl_uninit(void) { - rtnl_unregister(PF_BRIDGE, RTM_GETVLAN); - rtnl_unregister(PF_BRIDGE, RTM_NEWVLAN); - rtnl_unregister(PF_BRIDGE, RTM_DELVLAN); + rtnl_unregister_many(br_vlan_rtnl_msg_handlers); } From d51705614f668254cc5def7490df76f9680b4659 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:35 -0700 Subject: [PATCH 79/87] mctp: Handle error of rtnl_register_module(). Since introduced, mctp has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 583be982d934 ("mctp: Add device handling and netlink interface") Fixes: 831119f88781 ("mctp: Add neighbour netlink interface") Fixes: 06d2f4c583a7 ("mctp: Add netlink route management") Signed-off-by: Kuniyuki Iwashima Reviewed-by: Jeremy Kerr Signed-off-by: Paolo Abeni --- include/net/mctp.h | 2 +- net/mctp/af_mctp.c | 6 +++++- net/mctp/device.c | 30 ++++++++++++++++++------------ net/mctp/neigh.c | 31 +++++++++++++++++++------------ net/mctp/route.c | 33 +++++++++++++++++++++++---------- 5 files changed, 66 insertions(+), 36 deletions(-) diff --git a/include/net/mctp.h b/include/net/mctp.h index 7b17c52e8ce2a4..28d59ae94ca3b4 100644 --- a/include/net/mctp.h +++ b/include/net/mctp.h @@ -295,7 +295,7 @@ void mctp_neigh_remove_dev(struct mctp_dev *mdev); int mctp_routes_init(void); void mctp_routes_exit(void); -void mctp_device_init(void); +int mctp_device_init(void); void mctp_device_exit(void); #endif /* __NET_MCTP_H */ diff --git a/net/mctp/af_mctp.c b/net/mctp/af_mctp.c index 43288b408fde3b..f6de136008f6f9 100644 --- a/net/mctp/af_mctp.c +++ b/net/mctp/af_mctp.c @@ -756,10 +756,14 @@ static __init int mctp_init(void) if (rc) goto err_unreg_routes; - mctp_device_init(); + rc = mctp_device_init(); + if (rc) + goto err_unreg_neigh; return 0; +err_unreg_neigh: + mctp_neigh_exit(); err_unreg_routes: mctp_routes_exit(); err_unreg_proto: diff --git a/net/mctp/device.c b/net/mctp/device.c index acb97b25742896..85cc5f31f1e7c0 100644 --- a/net/mctp/device.c +++ b/net/mctp/device.c @@ -524,25 +524,31 @@ static struct notifier_block mctp_dev_nb = { .priority = ADDRCONF_NOTIFY_PRIORITY, }; -void __init mctp_device_init(void) +static const struct rtnl_msg_handler mctp_device_rtnl_msg_handlers[] = { + {THIS_MODULE, PF_MCTP, RTM_NEWADDR, mctp_rtm_newaddr, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_DELADDR, mctp_rtm_deladdr, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_GETADDR, NULL, mctp_dump_addrinfo, 0}, +}; + +int __init mctp_device_init(void) { - register_netdevice_notifier(&mctp_dev_nb); + int err; - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_GETADDR, - NULL, mctp_dump_addrinfo, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_NEWADDR, - mctp_rtm_newaddr, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_DELADDR, - mctp_rtm_deladdr, NULL, 0); + register_netdevice_notifier(&mctp_dev_nb); rtnl_af_register(&mctp_af_ops); + + err = rtnl_register_many(mctp_device_rtnl_msg_handlers); + if (err) { + rtnl_af_unregister(&mctp_af_ops); + unregister_netdevice_notifier(&mctp_dev_nb); + } + + return err; } void __exit mctp_device_exit(void) { + rtnl_unregister_many(mctp_device_rtnl_msg_handlers); rtnl_af_unregister(&mctp_af_ops); - rtnl_unregister(PF_MCTP, RTM_DELADDR); - rtnl_unregister(PF_MCTP, RTM_NEWADDR); - rtnl_unregister(PF_MCTP, RTM_GETADDR); - unregister_netdevice_notifier(&mctp_dev_nb); } diff --git a/net/mctp/neigh.c b/net/mctp/neigh.c index ffa0f9e0983fba..590f642413e4ef 100644 --- a/net/mctp/neigh.c +++ b/net/mctp/neigh.c @@ -322,22 +322,29 @@ static struct pernet_operations mctp_net_ops = { .exit = mctp_neigh_net_exit, }; +static const struct rtnl_msg_handler mctp_neigh_rtnl_msg_handlers[] = { + {THIS_MODULE, PF_MCTP, RTM_NEWNEIGH, mctp_rtm_newneigh, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_DELNEIGH, mctp_rtm_delneigh, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_GETNEIGH, NULL, mctp_rtm_getneigh, 0}, +}; + int __init mctp_neigh_init(void) { - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_NEWNEIGH, - mctp_rtm_newneigh, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_DELNEIGH, - mctp_rtm_delneigh, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_GETNEIGH, - NULL, mctp_rtm_getneigh, 0); - - return register_pernet_subsys(&mctp_net_ops); + int err; + + err = register_pernet_subsys(&mctp_net_ops); + if (err) + return err; + + err = rtnl_register_many(mctp_neigh_rtnl_msg_handlers); + if (err) + unregister_pernet_subsys(&mctp_net_ops); + + return err; } -void __exit mctp_neigh_exit(void) +void mctp_neigh_exit(void) { + rtnl_unregister_many(mctp_neigh_rtnl_msg_handlers); unregister_pernet_subsys(&mctp_net_ops); - rtnl_unregister(PF_MCTP, RTM_GETNEIGH); - rtnl_unregister(PF_MCTP, RTM_DELNEIGH); - rtnl_unregister(PF_MCTP, RTM_NEWNEIGH); } diff --git a/net/mctp/route.c b/net/mctp/route.c index eefd7834d9a009..597e9cf5aa6444 100644 --- a/net/mctp/route.c +++ b/net/mctp/route.c @@ -1474,26 +1474,39 @@ static struct pernet_operations mctp_net_ops = { .exit = mctp_routes_net_exit, }; +static const struct rtnl_msg_handler mctp_route_rtnl_msg_handlers[] = { + {THIS_MODULE, PF_MCTP, RTM_NEWROUTE, mctp_newroute, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_DELROUTE, mctp_delroute, NULL, 0}, + {THIS_MODULE, PF_MCTP, RTM_GETROUTE, NULL, mctp_dump_rtinfo, 0}, +}; + int __init mctp_routes_init(void) { + int err; + dev_add_pack(&mctp_packet_type); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_GETROUTE, - NULL, mctp_dump_rtinfo, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_NEWROUTE, - mctp_newroute, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MCTP, RTM_DELROUTE, - mctp_delroute, NULL, 0); + err = register_pernet_subsys(&mctp_net_ops); + if (err) + goto err_pernet; + + err = rtnl_register_many(mctp_route_rtnl_msg_handlers); + if (err) + goto err_rtnl; - return register_pernet_subsys(&mctp_net_ops); + return 0; + +err_rtnl: + unregister_pernet_subsys(&mctp_net_ops); +err_pernet: + dev_remove_pack(&mctp_packet_type); + return err; } void mctp_routes_exit(void) { + rtnl_unregister_many(mctp_route_rtnl_msg_handlers); unregister_pernet_subsys(&mctp_net_ops); - rtnl_unregister(PF_MCTP, RTM_DELROUTE); - rtnl_unregister(PF_MCTP, RTM_NEWROUTE); - rtnl_unregister(PF_MCTP, RTM_GETROUTE); dev_remove_pack(&mctp_packet_type); } From 5be2062e3080e3ff6707816caa445ec0c6eaacf7 Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:36 -0700 Subject: [PATCH 80/87] mpls: Handle error of rtnl_register_module(). Since introduced, mpls_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 03c0566542f4 ("mpls: Netlink commands to add, remove, and dump routes") Signed-off-by: Kuniyuki Iwashima Signed-off-by: Paolo Abeni --- net/mpls/af_mpls.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index aba983531ed323..df62638b649843 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -2728,6 +2728,15 @@ static struct rtnl_af_ops mpls_af_ops __read_mostly = { .get_stats_af_size = mpls_get_stats_af_size, }; +static const struct rtnl_msg_handler mpls_rtnl_msg_handlers[] __initdata_or_module = { + {THIS_MODULE, PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, 0}, + {THIS_MODULE, PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, 0}, + {THIS_MODULE, PF_MPLS, RTM_GETROUTE, mpls_getroute, mpls_dump_routes, 0}, + {THIS_MODULE, PF_MPLS, RTM_GETNETCONF, + mpls_netconf_get_devconf, mpls_netconf_dump_devconf, + RTNL_FLAG_DUMP_UNLOCKED}, +}; + static int __init mpls_init(void) { int err; @@ -2746,24 +2755,25 @@ static int __init mpls_init(void) rtnl_af_register(&mpls_af_ops); - rtnl_register_module(THIS_MODULE, PF_MPLS, RTM_NEWROUTE, - mpls_rtm_newroute, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MPLS, RTM_DELROUTE, - mpls_rtm_delroute, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_MPLS, RTM_GETROUTE, - mpls_getroute, mpls_dump_routes, 0); - rtnl_register_module(THIS_MODULE, PF_MPLS, RTM_GETNETCONF, - mpls_netconf_get_devconf, - mpls_netconf_dump_devconf, - RTNL_FLAG_DUMP_UNLOCKED); - err = ipgre_tunnel_encap_add_mpls_ops(); + err = rtnl_register_many(mpls_rtnl_msg_handlers); if (err) + goto out_unregister_rtnl_af; + + err = ipgre_tunnel_encap_add_mpls_ops(); + if (err) { pr_err("Can't add mpls over gre tunnel ops\n"); + goto out_unregister_rtnl; + } err = 0; out: return err; +out_unregister_rtnl: + rtnl_unregister_many(mpls_rtnl_msg_handlers); +out_unregister_rtnl_af: + rtnl_af_unregister(&mpls_af_ops); + dev_remove_pack(&mpls_packet_type); out_unregister_pernet: unregister_pernet_subsys(&mpls_net_ops); goto out; From b5e837c86041bef60f36cf9f20a641a30764379a Mon Sep 17 00:00:00 2001 From: Kuniyuki Iwashima Date: Tue, 8 Oct 2024 11:47:37 -0700 Subject: [PATCH 81/87] phonet: Handle error of rtnl_register_module(). MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers"), once the first rtnl_register_module() allocated rtnl_msg_handlers[PF_PHONET], the following calls never failed. However, after the commit, rtnl_register_module() could fail silently to allocate rtnl_msg_handlers[PF_PHONET][msgtype] and requires error handling for each call. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's use rtnl_register_many() to handle the errors easily. Fixes: addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers") Signed-off-by: Kuniyuki Iwashima Acked-by: Rémi Denis-Courmont Signed-off-by: Paolo Abeni --- net/phonet/pn_netlink.c | 28 +++++++++++----------------- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/net/phonet/pn_netlink.c b/net/phonet/pn_netlink.c index 7008d402499d5b..894e5c72d6bfff 100644 --- a/net/phonet/pn_netlink.c +++ b/net/phonet/pn_netlink.c @@ -285,23 +285,17 @@ static int route_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return err; } +static const struct rtnl_msg_handler phonet_rtnl_msg_handlers[] __initdata_or_module = { + {THIS_MODULE, PF_PHONET, RTM_NEWADDR, addr_doit, NULL, 0}, + {THIS_MODULE, PF_PHONET, RTM_DELADDR, addr_doit, NULL, 0}, + {THIS_MODULE, PF_PHONET, RTM_GETADDR, NULL, getaddr_dumpit, 0}, + {THIS_MODULE, PF_PHONET, RTM_NEWROUTE, route_doit, NULL, 0}, + {THIS_MODULE, PF_PHONET, RTM_DELROUTE, route_doit, NULL, 0}, + {THIS_MODULE, PF_PHONET, RTM_GETROUTE, NULL, route_dumpit, + RTNL_FLAG_DUMP_UNLOCKED}, +}; + int __init phonet_netlink_register(void) { - int err = rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_NEWADDR, - addr_doit, NULL, 0); - if (err) - return err; - - /* Further rtnl_register_module() cannot fail */ - rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_DELADDR, - addr_doit, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_GETADDR, - NULL, getaddr_dumpit, 0); - rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_NEWROUTE, - route_doit, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_DELROUTE, - route_doit, NULL, 0); - rtnl_register_module(THIS_MODULE, PF_PHONET, RTM_GETROUTE, - NULL, route_dumpit, RTNL_FLAG_DUMP_UNLOCKED); - return 0; + return rtnl_register_many(phonet_rtnl_msg_handlers); } From aeb218d900e3ea2cc3878ba92cb4758227075358 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Wed, 9 Oct 2024 10:12:19 +0100 Subject: [PATCH 82/87] docs: netdev: document guidance on cleanup patches The purpose of this section is to document what is the current practice regarding clean-up patches which address checkpatch warnings and similar problems. I feel there is a value in having this documented so others can easily refer to it. Clearly this topic is subjective. And to some extent the current practice discourages a wider range of patches than is described here. But I feel it is best to start somewhere, with the most well established part of the current practice. Signed-off-by: Simon Horman Link: https://patch.msgid.link/20241009-doc-mc-clean-v2-1-e637b665fa81@kernel.org Signed-off-by: Jakub Kicinski --- Documentation/process/maintainer-netdev.rst | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst index c9edf9e7362d67..1ae71e31591cb4 100644 --- a/Documentation/process/maintainer-netdev.rst +++ b/Documentation/process/maintainer-netdev.rst @@ -355,6 +355,8 @@ just do it. As a result, a sequence of smaller series gets merged quicker and with better review coverage. Re-posting large series also increases the mailing list traffic. +.. _rcs: + Local variable ordering ("reverse xmas tree", "RCS") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -391,6 +393,21 @@ APIs and helpers, especially scoped iterators. However, direct use of ``__free()`` within networking core and drivers is discouraged. Similar guidance applies to declaring variables mid-function. +Clean-up patches +~~~~~~~~~~~~~~~~ + +Netdev discourages patches which perform simple clean-ups, which are not in +the context of other work. For example: + +* Addressing ``checkpatch.pl`` warnings +* Addressing :ref:`Local variable ordering` issues +* Conversions to device-managed APIs (``devm_`` helpers) + +This is because it is felt that the churn that such changes produce comes +at a greater cost than the value of such clean-ups. + +Conversely, spelling and grammar fixes are not discouraged. + Resending after review ~~~~~~~~~~~~~~~~~~~~~~ From 40dddd4b8bd08a69471efd96107a4e1c73fabefc Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 9 Oct 2024 18:58:02 +0000 Subject: [PATCH 83/87] ppp: fix ppp_async_encode() illegal access syzbot reported an issue in ppp_async_encode() [1] In this case, pppoe_sendmsg() is called with a zero size. Then ppp_async_encode() is called with an empty skb. BUG: KMSAN: uninit-value in ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline] BUG: KMSAN: uninit-value in ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675 ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline] ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675 ppp_async_send+0x130/0x1b0 drivers/net/ppp/ppp_async.c:634 ppp_channel_bridge_input drivers/net/ppp/ppp_generic.c:2280 [inline] ppp_input+0x1f1/0xe60 drivers/net/ppp/ppp_generic.c:2304 pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379 sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113 __release_sock+0x1da/0x330 net/core/sock.c:3072 release_sock+0x6b/0x250 net/core/sock.c:3626 pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:744 ____sys_sendmsg+0x903/0xb60 net/socket.c:2602 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656 __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742 __do_sys_sendmmsg net/socket.c:2771 [inline] __se_sys_sendmmsg net/socket.c:2768 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768 x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: slab_post_alloc_hook mm/slub.c:4092 [inline] slab_alloc_node mm/slub.c:4135 [inline] kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4187 kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587 __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678 alloc_skb include/linux/skbuff.h:1322 [inline] sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732 pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:744 ____sys_sendmsg+0x903/0xb60 net/socket.c:2602 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656 __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742 __do_sys_sendmmsg net/socket.c:2771 [inline] __se_sys_sendmmsg net/socket.c:2768 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768 x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f CPU: 1 UID: 0 PID: 5411 Comm: syz.1.14 Not tainted 6.12.0-rc1-syzkaller-00165-g360c1f1f24c6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+1d121645899e7692f92a@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241009185802.3763282-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- drivers/net/ppp/ppp_async.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ppp/ppp_async.c b/drivers/net/ppp/ppp_async.c index a940b9a67107a9..c97406c6004d42 100644 --- a/drivers/net/ppp/ppp_async.c +++ b/drivers/net/ppp/ppp_async.c @@ -542,7 +542,7 @@ ppp_async_encode(struct asyncppp *ap) * and 7 (code-reject) must be sent as though no options * had been negotiated. */ - islcp = proto == PPP_LCP && 1 <= data[2] && data[2] <= 7; + islcp = proto == PPP_LCP && count >= 3 && 1 <= data[2] && data[2] <= 7; if (i == 0) { if (islcp) From 6fd27ea183c208e478129a85e11d880fc70040f2 Mon Sep 17 00:00:00 2001 From: "D. Wythe" Date: Wed, 9 Oct 2024 14:55:16 +0800 Subject: [PATCH 84/87] net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC Eric report a panic on IPPROTO_SMC, and give the facts that when INET_PROTOSW_ICSK was set, icsk->icsk_sync_mss must be set too. Bug: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000086000005 EC = 0x21: IABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault user pgtable: 4k pages, 48-bit VAs, pgdp=00000001195d1000 [0000000000000000] pgd=0800000109c46003, p4d=0800000109c46003, pud=0000000000000000 Internal error: Oops: 0000000086000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 UID: 0 PID: 8037 Comm: syz.3.265 Not tainted 6.11.0-rc7-syzkaller-g5f5673607153 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : 0x0 lr : cipso_v4_sock_setattr+0x2a8/0x3c0 net/ipv4/cipso_ipv4.c:1910 sp : ffff80009b887a90 x29: ffff80009b887aa0 x28: ffff80008db94050 x27: 0000000000000000 x26: 1fffe0001aa6f5b3 x25: dfff800000000000 x24: ffff0000db75da00 x23: 0000000000000000 x22: ffff0000d8b78518 x21: 0000000000000000 x20: ffff0000d537ad80 x19: ffff0000d8b78000 x18: 1fffe000366d79ee x17: ffff8000800614a8 x16: ffff800080569b84 x15: 0000000000000001 x14: 000000008b336894 x13: 00000000cd96feaa x12: 0000000000000003 x11: 0000000000040000 x10: 00000000000020a3 x9 : 1fffe0001b16f0f1 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f x5 : 0000000000000040 x4 : 0000000000000001 x3 : 0000000000000000 x2 : 0000000000000002 x1 : 0000000000000000 x0 : ffff0000d8b78000 Call trace: 0x0 netlbl_sock_setattr+0x2e4/0x338 net/netlabel/netlabel_kapi.c:1000 smack_netlbl_add+0xa4/0x154 security/smack/smack_lsm.c:2593 smack_socket_post_create+0xa8/0x14c security/smack/smack_lsm.c:2973 security_socket_post_create+0x94/0xd4 security/security.c:4425 __sock_create+0x4c8/0x884 net/socket.c:1587 sock_create net/socket.c:1622 [inline] __sys_socket_create net/socket.c:1659 [inline] __sys_socket+0x134/0x340 net/socket.c:1706 __do_sys_socket net/socket.c:1720 [inline] __se_sys_socket net/socket.c:1718 [inline] __arm64_sys_socket+0x7c/0x94 net/socket.c:1718 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Code: ???????? ???????? ???????? ???????? (????????) ---[ end trace 0000000000000000 ]--- This patch add a toy implementation that performs a simple return to prevent such panic. This is because MSS can be set in sock_create_kern or smc_setsockopt, similar to how it's done in AF_SMC. However, for AF_SMC, there is currently no way to synchronize MSS within __sys_connect_file. This toy implementation lays the groundwork for us to support such feature for IPPROTO_SMC in the future. Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC") Reported-by: Eric Dumazet Signed-off-by: D. Wythe Reviewed-by: Eric Dumazet Reviewed-by: Wenjia Zhang Link: https://patch.msgid.link/1728456916-67035-1-git-send-email-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski --- net/smc/smc_inet.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/net/smc/smc_inet.c b/net/smc/smc_inet.c index a5b2041600f959..a944e7dcb8b967 100644 --- a/net/smc/smc_inet.c +++ b/net/smc/smc_inet.c @@ -108,12 +108,23 @@ static struct inet_protosw smc_inet6_protosw = { }; #endif /* CONFIG_IPV6 */ +static unsigned int smc_sync_mss(struct sock *sk, u32 pmtu) +{ + /* No need pass it through to clcsock, mss can always be set by + * sock_create_kern or smc_setsockopt. + */ + return 0; +} + static int smc_inet_init_sock(struct sock *sk) { struct net *net = sock_net(sk); /* init common smc sock */ smc_sk_init(net, sk, IPPROTO_SMC); + + inet_csk(sk)->icsk_sync_mss = smc_sync_mss; + /* create clcsock */ return smc_create_clcsk(net, sk, sk->sk_family); } From 7d3fce8cbe3a70a1c7c06c9b53696be5d5d8dd5c Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 9 Oct 2024 09:11:32 +0000 Subject: [PATCH 85/87] slip: make slhc_remember() more robust against malicious packets syzbot found that slhc_remember() was missing checks against malicious packets [1]. slhc_remember() only checked the size of the packet was at least 20, which is not good enough. We need to make sure the packet includes the IPv4 and TCP header that are supposed to be carried. Add iph and th pointers to make the code more readable. [1] BUG: KMSAN: uninit-value in slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666 slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666 ppp_receive_nonmp_frame+0xe45/0x35e0 drivers/net/ppp/ppp_generic.c:2455 ppp_receive_frame drivers/net/ppp/ppp_generic.c:2372 [inline] ppp_do_recv+0x65f/0x40d0 drivers/net/ppp/ppp_generic.c:2212 ppp_input+0x7dc/0xe60 drivers/net/ppp/ppp_generic.c:2327 pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379 sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113 __release_sock+0x1da/0x330 net/core/sock.c:3072 release_sock+0x6b/0x250 net/core/sock.c:3626 pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:744 ____sys_sendmsg+0x903/0xb60 net/socket.c:2602 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656 __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742 __do_sys_sendmmsg net/socket.c:2771 [inline] __se_sys_sendmmsg net/socket.c:2768 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768 x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: slab_post_alloc_hook mm/slub.c:4091 [inline] slab_alloc_node mm/slub.c:4134 [inline] kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4186 kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587 __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678 alloc_skb include/linux/skbuff.h:1322 [inline] sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732 pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:744 ____sys_sendmsg+0x903/0xb60 net/socket.c:2602 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656 __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742 __do_sys_sendmmsg net/socket.c:2771 [inline] __se_sys_sendmmsg net/socket.c:2768 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768 x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f CPU: 0 UID: 0 PID: 5460 Comm: syz.2.33 Not tainted 6.12.0-rc2-syzkaller-00006-g87d6aab2389e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Fixes: b5451d783ade ("slip: Move the SLIP drivers") Reported-by: syzbot+2ada1bc857496353be5a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/670646db.050a0220.3f80e.0027.GAE@google.com/T/#u Signed-off-by: Eric Dumazet Link: https://patch.msgid.link/20241009091132.2136321-1-edumazet@google.com Signed-off-by: Jakub Kicinski --- drivers/net/slip/slhc.c | 57 ++++++++++++++++++++++++----------------- 1 file changed, 34 insertions(+), 23 deletions(-) diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c index 252cd757d3a2e5..ee9fd3a94b96fe 100644 --- a/drivers/net/slip/slhc.c +++ b/drivers/net/slip/slhc.c @@ -643,46 +643,57 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize) int slhc_remember(struct slcompress *comp, unsigned char *icp, int isize) { - struct cstate *cs; - unsigned ihl; - + const struct tcphdr *th; unsigned char index; + struct iphdr *iph; + struct cstate *cs; + unsigned int ihl; - if(isize < 20) { - /* The packet is shorter than a legal IP header */ + /* The packet is shorter than a legal IP header. + * Also make sure isize is positive. + */ + if (isize < (int)sizeof(struct iphdr)) { +runt: comp->sls_i_runt++; - return slhc_toss( comp ); + return slhc_toss(comp); } + iph = (struct iphdr *)icp; /* Peek at the IP header's IHL field to find its length */ - ihl = icp[0] & 0xf; - if(ihl < 20 / 4){ - /* The IP header length field is too small */ - comp->sls_i_runt++; - return slhc_toss( comp ); - } - index = icp[9]; - icp[9] = IPPROTO_TCP; + ihl = iph->ihl; + /* The IP header length field is too small, + * or packet is shorter than the IP header followed + * by minimal tcp header. + */ + if (ihl < 5 || isize < ihl * 4 + sizeof(struct tcphdr)) + goto runt; + + index = iph->protocol; + iph->protocol = IPPROTO_TCP; if (ip_fast_csum(icp, ihl)) { /* Bad IP header checksum; discard */ comp->sls_i_badcheck++; - return slhc_toss( comp ); + return slhc_toss(comp); } - if(index > comp->rslot_limit) { + if (index > comp->rslot_limit) { comp->sls_i_error++; return slhc_toss(comp); } - + th = (struct tcphdr *)(icp + ihl * 4); + if (th->doff < sizeof(struct tcphdr) / 4) + goto runt; + if (isize < ihl * 4 + th->doff * 4) + goto runt; /* Update local state */ cs = &comp->rstate[comp->recv_current = index]; comp->flags &=~ SLF_TOSS; - memcpy(&cs->cs_ip,icp,20); - memcpy(&cs->cs_tcp,icp + ihl*4,20); + memcpy(&cs->cs_ip, iph, sizeof(*iph)); + memcpy(&cs->cs_tcp, th, sizeof(*th)); if (ihl > 5) - memcpy(cs->cs_ipopt, icp + sizeof(struct iphdr), (ihl - 5) * 4); - if (cs->cs_tcp.doff > 5) - memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4); - cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2; + memcpy(cs->cs_ipopt, &iph[1], (ihl - 5) * 4); + if (th->doff > 5) + memcpy(cs->cs_tcpopt, &th[1], (th->doff - 5) * 4); + cs->cs_hsize = ihl*2 + th->doff*2; cs->initialized = true; /* Put headers back on packet * Neither header checksum is recalculated From 9937aae39bc09645cd67d53e0320926cd91570de Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Wed, 9 Oct 2024 09:47:22 +0100 Subject: [PATCH 86/87] MAINTAINERS: consistently exclude wireless files from NETWORKING [GENERAL] We already exclude wireless drivers from the netdev@ traffic, to delegate it to linux-wireless@, and avoid overwhelming netdev@. Many of the following wireless-related sections MAINTAINERS are already not included in the NETWORKING [GENERAL] section. For consistency, exclude those that are. * 802.11 (including CFG80211/NL80211) * MAC80211 * RFKILL Acked-by: Johannes Berg Signed-off-by: Simon Horman Link: https://patch.msgid.link/20241009-maint-net-hdrs-v2-1-f2c86e7309c8@kernel.org Signed-off-by: Jakub Kicinski --- MAINTAINERS | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 58380aeafbf0da..89789bddfb6aab 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16197,8 +16197,19 @@ F: lib/random32.c F: net/ F: tools/net/ F: tools/testing/selftests/net/ +X: Documentation/networking/mac80211-injection.rst +X: Documentation/networking/mac80211_hwsim/ +X: Documentation/networking/regulatory.rst +X: include/net/cfg80211.h +X: include/net/ieee80211_radiotap.h +X: include/net/iw_handler.h +X: include/net/mac80211.h +X: include/net/wext.h X: net/9p/ X: net/bluetooth/ +X: net/mac80211/ +X: net/rfkill/ +X: net/wireless/ NETWORKING [IPSEC] M: Steffen Klassert From 5404b5a2fea9831a1f5be4ab9a94de07d976b177 Mon Sep 17 00:00:00 2001 From: Simon Horman Date: Wed, 9 Oct 2024 09:47:23 +0100 Subject: [PATCH 87/87] MAINTAINERS: Add headers and mailing list to UDP section Add netdev mailing list and some more udp.h headers to the UDP section. This is now more consistent with the TCP section. Acked-by: Willem de Bruijn Signed-off-by: Simon Horman Link: https://patch.msgid.link/20241009-maint-net-hdrs-v2-2-f2c86e7309c8@kernel.org Signed-off-by: Jakub Kicinski --- MAINTAINERS | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 89789bddfb6aab..baf2ba9dc0707b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -24184,8 +24184,12 @@ F: drivers/usb/host/xhci* USER DATAGRAM PROTOCOL (UDP) M: Willem de Bruijn +L: netdev@vger.kernel.org S: Maintained F: include/linux/udp.h +F: include/net/udp.h +F: include/trace/events/udp.h +F: include/uapi/linux/udp.h F: net/ipv4/udp.c F: net/ipv6/udp.c