REMOVE_ADDR support #50

matttbe · 2020-07-13T14:16:33Z

As defined in RFC8684:

If, during the lifetime of an MPTCP connection, a previously announced address becomes invalid (e.g., if the interface disappears or an IPv6 address is no longer preferred), the affected host SHOULD announce this situation so that the peer can remove subflows related to this address. Even if an address is not in use by an MPTCP connection, if it has been previously announced, an implementation SHOULD announce its removal. A host MAY also choose to announce that a valid IP address should not be used any longer -- for example, for make‑before-break session continuity.
This is achieved through the Remove Address (REMOVE_ADDR) option (Figure 13), which will remove a previously added address (or list of addresses) from a connection and terminate any subflows currently using that address.
                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+-------+-------+---------------+
|     Kind      |Length = 3 + n |Subtype|(resvd)|   Address ID  | ...
+---------------+---------------+-------+-------+---------------+
                           (followed by n-1 Address IDs, if required)
Figure 13: Remove Address (REMOVE_ADDR) Option

For security purposes, if a host receives a REMOVE_ADDR option, it must ensure that the affected path or paths are no longer in use before it instigates closure. The receipt of REMOVE_ADDR SHOULD first trigger the sending of a TCP keepalive [RFC1122] on the path, and if a response is received, the path SHOULD NOT be removed. If the path is found to still be alive, the receiving host SHOULD no longer use the specified address for future connections, but it is the responsibility of the host that sent the REMOVE_ADDR to shut down the subflow. Before the address is removed, the requesting host MAY also use MP_PRIO (Section 3.3.8) to request that a path no longer be used. Typical TCP validity tests on the subflow (e.g., ensuring that sequence and ACK numbers are correct) MUST also be undertaken. An implementation can use indications of these test failures as part of intrusion detection or error logging.
The sending and receipt (if no keepalive response was received) of this message SHOULD trigger the sending of RSTs by both hosts on the affected subflow(s) (if possible), as a courtesy, to allow the cleanup of middlebox state before cleaning up any local state.
Address removal is undertaken according to the Address ID, so as to permit the use of NATs and other middleboxes that rewrite source addresses. If an Address ID is not known, the receiver will silently ignore the request.
A subflow that is still functioning MUST be closed with a FIN exchange as in regular TCP, rather than using this option. For more information, see Section 3.3.3.

https://www.rfc-editor.org/rfc/rfc8684.html#name-remove-address

If needed, new tickets can be created to track: receiver/sender side only and packetdrill support.

(Feature from the initial roadmap)

The text was updated successfully, but these errors were encountered:

matttbe · 2020-07-17T15:30:36Z

@geliangtang is looking at this

matttbe · 2020-08-06T16:19:21Z

@geliangtang : maybe something that would be really good to have with the patches you sent is test: either with kselftests or packetdrill. Maybe it makes more sense with packetdrill but already having some tests validating that and checking we don't have regressions would be really nice if you have the opportunity to work on that :-)

matttbe · 2020-09-30T19:44:28Z

Implemented by @geliangtang and now on net-next, thx! → a1a3552
This ticket can then be closed!

…tirq Commit 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][ C0] BUG: scheduling while atomic: fio/2602/0x00000102 [ 210.236701][ C0] Modules linked in: [ 210.237566][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ #50 [ 210.239292][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][ C0] Call Trace: [ 210.241946][ C0] <IRQ> [ 210.242561][ C0] dump_stack+0x7d/0xa3 [ 210.243466][ C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][ C0] __schedule+0x156f/0x20d0 [ 210.245518][ C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][ C0] schedule+0xd0/0x270 [ 210.247541][ C0] schedule_timeout+0x1fb/0x280 [ 210.248586][ C0] ? usleep_range+0x150/0x150 [ 210.249624][ C0] ? unpoison_range+0x3a/0x60 [ 210.250632][ C0] ? ____kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][ C0] ? unpoison_range+0x3a/0x60 [ 210.252958][ C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][ C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][ C0] ? usleep_range+0x150/0x150 [ 210.256192][ C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][ C0] ? blk_update_request+0x757/0x1150 [ 210.258582][ C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][ C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][ C0] wait_for_completion+0x4c/0x90 [ 210.261971][ C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][ C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][ C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][ C0] ? crypt_endio+0x91/0x180 [ 210.268327][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][ C0] blk_update_request+0x757/0x1150 [ 210.270563][ C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][ C0] blk_done_softirq+0x21d/0x340 [ 210.272775][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][ C0] __do_softirq+0x190/0x611 [ 210.277203][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][ C0] </IRQ> [ 210.280164][ C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][ C0] common_interrupt+0x74/0x120 [ 210.283376][ C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][ C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: [email protected] # v5.9+ Signed-off-by: Ignat Korchagin <[email protected]> Acked-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>

Commit 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. In this case sometimes when allocating a new crypto request we may get a stacktrace like below: [ 210.103008][ C0] BUG: sleeping function called from invalid context at mm/mempool.c:381 [ 210.104746][ C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2602, name: fio [ 210.106599][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ #50 [ 210.108331][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.110212][ C0] Call Trace: [ 210.110921][ C0] <IRQ> [ 210.111527][ C0] dump_stack+0x7d/0xa3 [ 210.112411][ C0] ___might_sleep.cold+0x122/0x151 [ 210.113527][ C0] mempool_alloc+0x16b/0x2f0 [ 210.114524][ C0] ? __queue_work+0x515/0xde0 [ 210.115553][ C0] ? mempool_resize+0x700/0x700 [ 210.116586][ C0] ? crypt_endio+0x91/0x180 [ 210.117479][ C0] ? blk_update_request+0x757/0x1150 [ 210.118513][ C0] ? blk_mq_end_request+0x4b/0x480 [ 210.119572][ C0] ? blk_done_softirq+0x21d/0x340 [ 210.120628][ C0] ? __do_softirq+0x190/0x611 [ 210.121626][ C0] crypt_convert+0x29f9/0x4c00 [ 210.122668][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.123824][ C0] ? kasan_set_track+0x1c/0x30 [ 210.124858][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.125930][ C0] ? kmem_cache_free+0x104/0x470 [ 210.126973][ C0] ? crypt_endio+0x91/0x180 [ 210.127947][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.129165][ C0] blk_update_request+0x757/0x1150 [ 210.130231][ C0] blk_mq_end_request+0x4b/0x480 [ 210.131294][ C0] blk_done_softirq+0x21d/0x340 [ 210.132332][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.133289][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.134399][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.135458][ C0] __do_softirq+0x190/0x611 [ 210.136409][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.137447][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.138507][ C0] </IRQ> [ 210.139118][ C0] do_softirq_own_stack+0x37/0x40 [ 210.140191][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.141151][ C0] common_interrupt+0x74/0x120 [ 210.142171][ C0] asm_common_interrupt+0x1e/0x40 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context. Fixes: 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: [email protected] # v5.9+ Reported-by: Maciej S. Szmigiero <[email protected]> Signed-off-by: Ignat Korchagin <[email protected]> Acked-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>

An out of bounds write happens when setting the default power state. KASAN sees this as: [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 #50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 kasan_report+0x170/0x1a8 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] radeon_pm_init+0x1019/0x1904 [radeon] rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 driver_probe_device+0xe6/0x135 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 do_one_initcall+0x149/0x2fd do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca __do_sys_finit_module+0xf1/0x140 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Without KASAN, this will manifest later when the kernel attempts to allocate memory that was stomped, since it collides with the inline slab freelist pointer: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G W 5.10.12-gentoo-E620 #2 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 RIP: 0010:kfree+0x115/0x230 Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7 RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246 RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80 RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000 R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100 FS: 00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0 Call Trace: __free_fdtable+0x16/0x1f put_files_struct+0x81/0x9b do_exit+0x433/0x94d do_group_exit+0xa6/0xa6 __x64_sys_exit_group+0xf/0xf do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe80ef64bea Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0. RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0 R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0 Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek ... Use a valid power_state index when initializing the "flags" and "misc" and "misc2" fields. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211537 Reported-by: Erhard F. <[email protected]> Fixes: a48b9b4 ("drm/radeon/kms/pm: add asic specific callbacks for getting power state (v2)") Fixes: 79daedc ("drm/radeon/kms: minor pm cleanups") Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

Cifsd for next

Fix suspicious rcu_dereference_protected() usage when offloading tc action. We should hold tcfa_lock to offload tc action in action initiation. Without these changes, the following warning will be observed: WARNING: suspicious RCU usage 5.16.0-rc5-net-next-01504-g7d1f236dcffa-dirty #50 Tainted: G I ----------------------------- include/net/tc_act/tc_tunnel_key.h:33 suspicious rcu_dereference_protected() usage! 1 lock held by tc/12108: CPU: 4 PID: 12108 Comm: tc Tainted: G Hardware name: Dell Inc. PowerEdge R740/07WCGN, BIOS 1.6.11 11/20/2018 Call Trace: <TASK> dump_stack_lvl+0x49/0x5e dump_stack+0x10/0x12 lockdep_rcu_suspicious+0xed/0xf8 tcf_tunnel_key_offload_act_setup+0x1de/0x300 [act_tunnel_key] tcf_action_offload_add_ex+0xc0/0x1f0 tcf_action_init+0x26a/0x2f0 tcf_action_add+0xa9/0x1f0 tc_ctl_action+0xfb/0x170 rtnetlink_rcv_msg+0x169/0x510 ? sched_clock+0x9/0x10 ? rtnl_newlink+0x70/0x70 netlink_rcv_skb+0x55/0x100 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x1a8/0x270 netlink_sendmsg+0x245/0x490 sock_sendmsg+0x65/0x70 ____sys_sendmsg+0x219/0x260 ? __import_iovec+0x2c/0x150 ___sys_sendmsg+0xb7/0x100 ? __lock_acquire+0x3d5/0x1f40 ? __this_cpu_preempt_check+0x13/0x20 ? lock_is_held_type+0xe4/0x140 ? sched_clock+0x9/0x10 ? ktime_get_coarse_real_ts64+0xbe/0xd0 ? __this_cpu_preempt_check+0x13/0x20 ? lockdep_hardirqs_on+0x7e/0x100 ? ktime_get_coarse_real_ts64+0xbe/0xd0 ? trace_hardirqs_on+0x2a/0xf0 __sys_sendmsg+0x5a/0xa0 ? syscall_trace_enter.constprop.0+0x1dd/0x220 __x64_sys_sendmsg+0x1f/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4db7bb7a60 Fixes: 8cbfe93 ("flow_offload: allow user to offload tc action to net device") Reported-by: Eric Dumazet <[email protected]> Signed-off-by: Baowen Zheng <[email protected]> Signed-off-by: Louis Peens <[email protected]> Signed-off-by: David S. Miller <[email protected]>

Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [[email protected]: v4] Link: https://lkml.kernel.org/r/[email protected]: https://lkml.kernel.org/r/[email protected] Fixes: 802a3a9 ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Yang Shi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Dan Hill <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Dongdong Tao <[email protected]> Cc: Gavin Guo <[email protected]> Cc: Gerald Yang <[email protected]> Cc: Heitor Alves de Siqueira <[email protected]> Cc: Ioanna Alifieraki <[email protected]> Cc: Jay Vosburgh <[email protected]> Cc: Matthew Ruffell <[email protected]> Cc: Ponnuvel Palaniyappan <[email protected]> Cc: <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The commit 4f7e723 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <[email protected]> Signed-off-by: Qi Zheng <[email protected]> Acked-by: Muchun Song <[email protected]> Fixes: 05c7b7a ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: [email protected] # v5.17+ Signed-off-by: Tejun Heo <[email protected]>

Commit a1d7671 ("md: use mddev->external to select holder in export_rdev()") fix the problem that 'claim_rdev' is used for blkdev_get_by_dev() while 'rdev' is used for blkdev_put(). However, if mddev->external is changed from 0 to 1, then 'rdev' is used for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And this problem can be reporduced reliably by following: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330 Modules linked in: multipath md_mod loop CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50 RIP: 0010:blkdev_put+0x27c/0x330 Call Trace: <TASK> export_rdev.isra.23+0x50/0xa0 [md_mod] mddev_unlock+0x19d/0x300 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix the problem by recording if 'rdev' is used as holder. Fixes: a1d7671 ("md: use mddev->external to select holder in export_rdev()") Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]

This is a partial backport of the upstram commit 39880bd ("mptcp: get rid of msk->subflow"). It's partial to avoid a long a complex dependency chain not suitable for stable. The only bit remaning from the original commit is the introduction of a new field avoid a race at close time causing an UaF: BUG: KASAN: use-after-free in mptcp_subflow_queue_clean+0x2c9/0x390 include/net/mptcp.h:104 Read of size 1 at addr ffff88803bf72884 by task syz-executor.6/23092 CPU: 0 PID: 23092 Comm: syz-executor.6 Not tainted 6.1.65-gc6114c845984 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x125/0x18f lib/dump_stack.c:106 print_report+0x163/0x4f0 mm/kasan/report.c:284 kasan_report+0xc4/0x100 mm/kasan/report.c:495 mptcp_subflow_queue_clean+0x2c9/0x390 include/net/mptcp.h:104 mptcp_check_listen_stop+0x190/0x2a0 net/mptcp/protocol.c:3009 __mptcp_close+0x9a/0x970 net/mptcp/protocol.c:3024 mptcp_close+0x2a/0x130 net/mptcp/protocol.c:3089 inet_release+0x13d/0x190 net/ipv4/af_inet.c:429 sock_close+0xcf/0x230 net/socket.c:652 __fput+0x3a2/0x870 fs/file_table.c:320 task_work_run+0x24e/0x300 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop+0xa4/0xc0 kernel/entry/common.c:171 exit_to_user_mode_prepare+0x51/0x90 kernel/entry/common.c:204 syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:286 do_syscall_64+0x53/0xa0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x64/0xce RIP: 0033:0x41d791 Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 74 2a 00 00 c3 48 83 ec 08 e8 9a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 e3 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01 RSP: 002b:00000000008bfb90 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 000000000041d791 RDX: 0000001b33920000 RSI: ffffffff8139adff RDI: 0000000000000003 RBP: 000000000079d980 R08: 0000001b33d20000 R09: 0000000000000951 R10: 000000008139a955 R11: 0000000000000293 R12: 00000000000c739b R13: 000000000079bf8c R14: 00007fa301053000 R15: 00000000000c705a </TASK> Allocated by task 22528: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x40/0x70 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0xa0/0xb0 mm/kasan/common.c:383 kasan_kmalloc include/linux/kasan.h:211 [inline] __do_kmalloc_node mm/slab_common.c:955 [inline] __kmalloc+0xaa/0x1c0 mm/slab_common.c:968 kmalloc include/linux/slab.h:558 [inline] sk_prot_alloc+0xac/0x200 net/core/sock.c:2038 sk_clone_lock+0x56/0x1090 net/core/sock.c:2236 inet_csk_clone_lock+0x26/0x420 net/ipv4/inet_connection_sock.c:1141 tcp_create_openreq_child+0x30/0x1910 net/ipv4/tcp_minisocks.c:474 tcp_v6_syn_recv_sock+0x413/0x1a90 net/ipv6/tcp_ipv6.c:1283 subflow_syn_recv_sock+0x621/0x1300 net/mptcp/subflow.c:730 tcp_get_cookie_sock+0xf0/0x5f0 net/ipv4/syncookies.c:201 cookie_v6_check+0x130f/0x1c50 net/ipv6/syncookies.c:261 tcp_v6_do_rcv+0x7e0/0x12b0 net/ipv6/tcp_ipv6.c:1147 tcp_v6_rcv+0x2494/0x2f50 net/ipv6/tcp_ipv6.c:1743 ip6_protocol_deliver_rcu+0xba3/0x1620 net/ipv6/ip6_input.c:438 ip6_input+0x1bc/0x470 net/ipv6/ip6_input.c:483 ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:302 __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5525 process_backlog+0x353/0x660 net/core/dev.c:5967 __napi_poll+0xc6/0x5a0 net/core/dev.c:6534 net_rx_action+0x652/0xea0 net/core/dev.c:6601 __do_softirq+0x176/0x525 kernel/softirq.c:571 Freed by task 23093: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x40/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:516 ____kasan_slab_free+0x13a/0x1b0 mm/kasan/common.c:236 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1724 [inline] slab_free_freelist_hook mm/slub.c:1750 [inline] slab_free mm/slub.c:3661 [inline] __kmem_cache_free+0x1eb/0x340 mm/slub.c:3674 sk_prot_free net/core/sock.c:2074 [inline] __sk_destruct+0x4ad/0x620 net/core/sock.c:2160 tcp_v6_rcv+0x269c/0x2f50 net/ipv6/tcp_ipv6.c:1761 ip6_protocol_deliver_rcu+0xba3/0x1620 net/ipv6/ip6_input.c:438 ip6_input+0x1bc/0x470 net/ipv6/ip6_input.c:483 ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:302 __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5525 process_backlog+0x353/0x660 net/core/dev.c:5967 __napi_poll+0xc6/0x5a0 net/core/dev.c:6534 net_rx_action+0x652/0xea0 net/core/dev.c:6601 __do_softirq+0x176/0x525 kernel/softirq.c:571 The buggy address belongs to the object at ffff88803bf72000 which belongs to the cache kmalloc-4k of size 4096 The buggy address is located 2180 bytes inside of 4096-byte region [ffff88803bf72000, ffff88803bf73000) The buggy address belongs to the physical page: page:00000000a72e4e51 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3bf70 head:00000000a72e4e51 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x100000000010200(slab|head|node=0|zone=1) raw: 0100000000010200 ffffea0000a0ea00 dead000000000002 ffff888100042140 raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88803bf72780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88803bf72800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88803bf72880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88803bf72900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88803bf72980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Prevent the MPTCP worker from freeing the first subflow for unaccepted socket when such sockets transition to TCP_CLOSE state, and let that happen at accept() or listener close() time. Fixes: b6985b9 ("mptcp: use the workqueue to destroy unaccepted sockets") Signed-off-by: Paolo Abeni <[email protected]>

matttbe added the enhancement label Jul 13, 2020

matttbe assigned geliangtang Aug 6, 2020

matttbe closed this as completed Sep 30, 2020

dcaratti pushed a commit to dcaratti/mptcp_net-next that referenced this issue Sep 2, 2021

Merge pull request multipath-tcp#50 from namjaejeon/cifsd-for-next

bfe6cb0

Cifsd for next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REMOVE_ADDR support #50

REMOVE_ADDR support #50

matttbe commented Jul 13, 2020

matttbe commented Jul 17, 2020

matttbe commented Aug 6, 2020

matttbe commented Sep 30, 2020

REMOVE_ADDR support #50

REMOVE_ADDR support #50

Comments

matttbe commented Jul 13, 2020

matttbe commented Jul 17, 2020

matttbe commented Aug 6, 2020

matttbe commented Sep 30, 2020