Update from Source Repository #474

CerebralMischief · 2017-10-01T06:57:09Z

No description provided.

Update from source fork.

Update from source repo

Update from source Repo

Update from Source Repository

Update from source repo

Update from Source Repository

Update from Source Repo

Update from source repository

Update From Source Repository

Update from source repo

Update from Source Repository

Update from source repository

Update from source Repository

Update from source repo

Update from Source Repository

CerebralMischief · 2017-10-01T06:57:45Z

Update from Source Repository

CerebralMischief · 2017-10-01T06:58:33Z

Whoops. Wrong direction. :/

When interface's namespace is being changed, dev_change_net_namespace() is called. This removes and re-allocates many resources that include sysfs files. The "/net/class/net/<interface name>" is one of them. If the sysfs creation routine(device_rename()) found duplicate sysfs file name, it warns about it and fails. But unfortunately, at that point, dev_change_net_namespace() doesn't return fail because rollback cost is too high. So, the interface can't have a sysfs file. The approach of this patch is to find the duplicate sysfs file as fast as possible. If it found that, dev_change_net_namespace() returns fail immediately with zero rollback cost. This patch includes two other things. a) Acquire rtnl_lock() in both bond_create_sysfs() and bond_destroy_sysfs() to avoid race condition. b) Do not remove "/sys/class/net/bonding_masters" sysfs file by bond_destroy_sysfs() if the file wasn't created by bond_create_sysfs(). Test commands: ip netns add nst ip link add bonding_masters type dummy modprobe bonding ip link set bonding_masters netns nst Splat looks like: [ 32.793965][ T986] WARNING: CPU: 3 PID: 986 at net/core/dev.c:10098 dev_change_net_namespace+0x9be/0xc10 [ 32.795213][ T986] Modules linked in: bonding dummy openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 x [ 32.797369][ T986] CPU: 3 PID: 986 Comm: ip Not tainted 5.6.0-rc5+ torvalds#474 [ 32.798137][ T986] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 32.799111][ T986] RIP: 0010:dev_change_net_namespace+0x9be/0xc10 [ 32.799838][ T986] Code: 45 34 b2 c6 05 85 a4 87 01 01 e8 0d aa c7 fe 0f 0b e9 dd f6 ff ff b8 ea ff ff ff e9 82 fb ff [ 32.805599][ T986] RSP: 0018:ffff88804aeeee60 EFLAGS: 00010282 [ 32.806247][ T986] RAX: 00000000ffffffef RBX: ffff888057151000 RCX: 0000000000000006 [ 32.807110][ T986] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88804ac2c014 [ 32.807997][ T986] RBP: ffff8880571510b8 R08: fffffbfff67b65cc R09: fffffbfff67b65cc [ 32.808873][ T986] R10: 0000000000000001 R11: fffffbfff67b65cb R12: ffff8880571510a0 [ 32.809720][ T986] R13: ffff88804b9f0040 R14: ffff888057151090 R15: ffff888057151c08 [ 32.810575][ T986] FS: 00007f0c9d5960c0(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 32.811540][ T986] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.812314][ T986] CR2: 00007fcaf6747590 CR3: 0000000049c58005 CR4: 00000000000606e0 [ 32.813191][ T986] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 32.814052][ T986] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 32.822906][ T986] Call Trace: [ 32.823294][ T986] ? do_dup2+0x450/0x450 [ 32.823828][ T986] ? dev_get_valid_name+0xc0/0xc0 [ 32.824421][ T986] ? ns_capable_common+0x5c/0xd0 [ 32.825007][ T986] ? __netlink_ns_capable+0xc3/0xf0 [ 32.825650][ T986] do_setlink+0x163/0x2ef0 [ 32.826088][ T986] ? is_bpf_image_address+0xff/0x1d0 [ 32.826663][ T986] ? rtnl_getlink+0x8a0/0x8a0 [ 32.827275][ T986] ? __kernel_text_address+0xe/0x30 [ 32.827999][ T986] ? unwind_get_return_address+0x5f/0xa0 [ 32.828793][ T986] ? create_prof_cpu_mask+0x20/0x20 [ 32.829391][ T986] ? arch_stack_walk+0x83/0xb0 [ 32.829949][ T986] ? memset+0x1f/0x40 [ 32.830410][ T986] ? __nla_validate_parse+0x98/0x1ab0 [ 32.831046][ T986] ? nla_memcpy+0x90/0x90 [ 32.831544][ T986] ? __lock_acquire+0xdfe/0x3de0 [ 32.832136][ T986] __rtnl_newlink+0x9c5/0x1270 [ ... ] Reported-by: [email protected] Fixes: b76cdba ("[PATCH] bonding: add sysfs functionality to bonding (large)") Signed-off-by: Taehee Yoo <[email protected]>

Suppress gcc warnings

checkpatch warns about comparisons to NULL, e.g. CHECK: Comparison to NULL could be written "!rt" torvalds#474: FILE: net/l2tp/l2tp_ip.c:474: + if (rt == NULL) { These sort of comparisons are generally clearer and more readable the way checkpatch suggests, so update l2tp accordingly. Signed-off-by: Tom Parkin <[email protected]>

checkpatch warns about comparisons to NULL, e.g. CHECK: Comparison to NULL could be written "!rt" torvalds#474: FILE: net/l2tp/l2tp_ip.c:474: + if (rt == NULL) { These sort of comparisons are generally clearer and more readable the way checkpatch suggests, so update l2tp accordingly. Signed-off-by: Tom Parkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>

…b cache Some socket buffers allocated in the fclone cache (in __alloc_skb) can end-up in the following path[1]: napi_skb_finish __kfree_skb_defer napi_skb_cache_put The issue is napi_skb_cache_put is not fclone friendly and will put those skbuff in the skb cache to be reused later, although this cache only expects skbuff allocated from skbuff_head_cache. When this happens the skbuff is eventually freed using the wrong origin cache, and we can see traces similar to: [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0 [ 1223.950211] Modules linked in: [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ torvalds#474 [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014 [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0 Leading sometimes to other memory related issues. Fix this by using __kfree_skb for fclone skbuff, similar to what is done the other place __kfree_skb_defer is called. [1] At least in setups using veth pairs and tunnels. Building a kernel with KASAN we can for example see packets allocated in sk_stream_alloc_skb hit the above path and later the issue arises when the skbuff is reused. Fixes: 9243adf ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing") Cc: Alexander Lobakin <[email protected]> Signed-off-by: Antoine Tenart <[email protected]>

…b cache Some socket buffers allocated in the fclone cache (in __alloc_skb) can end-up in the following path[1]: napi_skb_finish __kfree_skb_defer napi_skb_cache_put The issue is napi_skb_cache_put is not fclone friendly and will put those skbuff in the skb cache to be reused later, although this cache only expects skbuff allocated from skbuff_head_cache. When this happens the skbuff is eventually freed using the wrong origin cache, and we can see traces similar to: [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0 [ 1223.950211] Modules linked in: [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ torvalds#474 [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014 [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0 Leading sometimes to other memory related issues. Fix this by using __kfree_skb for fclone skbuff, similar to what is done the other place __kfree_skb_defer is called. [1] At least in setups using veth pairs and tunnels. Building a kernel with KASAN we can for example see packets allocated in sk_stream_alloc_skb hit the above path and later the issue arises when the skbuff is reused. Fixes: 9243adf ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing") Cc: Alexander Lobakin <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>

…b cache commit 28b34f0 upstream. Some socket buffers allocated in the fclone cache (in __alloc_skb) can end-up in the following path[1]: napi_skb_finish __kfree_skb_defer napi_skb_cache_put The issue is napi_skb_cache_put is not fclone friendly and will put those skbuff in the skb cache to be reused later, although this cache only expects skbuff allocated from skbuff_head_cache. When this happens the skbuff is eventually freed using the wrong origin cache, and we can see traces similar to: [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0 [ 1223.950211] Modules linked in: [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ torvalds#474 [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014 [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0 Leading sometimes to other memory related issues. Fix this by using __kfree_skb for fclone skbuff, similar to what is done the other place __kfree_skb_defer is called. [1] At least in setups using veth pairs and tunnels. Building a kernel with KASAN we can for example see packets allocated in sk_stream_alloc_skb hit the above path and later the issue arises when the skbuff is reused. Fixes: 9243adf ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing") Cc: Alexander Lobakin <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Fix soundness issue with `container_of!` macro

I got the double free report: [ 68.308365][ T359] BUG: KASAN: double-free or invalid-free in kfree+0xce/0x390 [ 68.309532][ T359] [ 68.309886][ T359] CPU: 0 PID: 359 Comm: xrun Tainted: G W 5.15.0-rc3-00109-g4dfd49fafc4d-dirty torvalds#474 523b7f3c65c42247635e2ac04a95f61f9f36678d [ 68.312059][ T359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 68.313566][ T359] Call Trace: [ 68.314063][ T359] dump_stack_lvl+0xe2/0x152 [ 68.314793][ T359] print_address_description.constprop.7+0x21/0x150 [ 68.315842][ T359] ? kfree+0xce/0x390 [ 68.316444][ T359] kasan_report_invalid_free+0x6f/0xa0 [ 68.317289][ T359] ? kfree+0xce/0x390 [ 68.317902][ T359] __kasan_slab_free+0x125/0x140 [ 68.318660][ T359] slab_free_freelist_hook+0x10d/0x240 [ 68.319497][ T359] ? iio_device_unregister_sysfs+0x108/0x13b [industrialio] [ 68.321179][ T359] kfree+0xce/0x390 [ 68.321781][ T359] iio_device_unregister_sysfs+0x108/0x13b [industrialio] [ 68.323438][ T359] iio_dev_release+0x9e/0x10e [industrialio] [ 68.324902][ T359] ? iio_device_unregister_sysfs+0x13b/0x13b [industrialio] [ 68.326550][ T359] device_release+0xa5/0x240 [ 68.327258][ T359] kobject_put+0x1e5/0x540 [ 68.327954][ T359] put_device+0x20/0x30 [ 68.328612][ T359] devm_iio_device_release+0x21/0x30 [industrialio] [ 68.330172][ T359] release_nodes+0xc3/0x3b0 [ 68.330874][ T359] ? __sanitizer_cov_trace_pc+0x1d/0x50 [ 68.331765][ T359] ? _raw_spin_unlock_irqrestore+0x4b/0x5d [ 68.332668][ T359] ? trace_hardirqs_on+0x63/0x2d0 [ 68.333509][ T359] devres_release_group+0x1da/0x2c0 [ 68.334325][ T359] ? release_nodes+0x3b0/0x3b0 [ 68.335069][ T359] ? __devm_iio_device_register+0x36/0x80 [industrialio] [ 68.336721][ T359] ? max517_probe+0x3df/0x6b0 [max517] [ 68.338122][ T359] i2c_device_probe+0x628/0xbb0 [ 68.338886][ T359] ? i2c_device_match+0x110/0x110 [ 68.339674][ T359] really_probe+0x285/0xc30 If __iio_device_register() fails, iio_dev_opaque->groups will be freed in error path in iio_device_unregister_sysfs(), then iio_dev_release() will call iio_device_unregister_sysfs() again, it causes double free. Set iio_dev_opaque->groups to NULL when it's freed to fix this double free. Fixes: 32f1717 ("iio: core: rework iio device group creation") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yang Yingliang <[email protected]>

I got the double free report: BUG: KASAN: double-free or invalid-free in kfree+0xce/0x390 CPU: 0 PID: 359 Comm: xrun Tainted: G W 5.15.0-rc3-00109-g4dfd49fafc4d-dirty torvalds#474 523b7f3c65c42247635e2ac04a95f61f9f36678d Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0xe2/0x152 print_address_description.constprop.7+0x21/0x150 kasan_report_invalid_free+0x6f/0xa0 __kasan_slab_free+0x125/0x140 slab_free_freelist_hook+0x10d/0x240 kfree+0xce/0x390 iio_device_unregister_sysfs+0x108/0x13b [industrialio] iio_dev_release+0x9e/0x10e [industrialio] device_release+0xa5/0x240 kobject_put+0x1e5/0x540 put_device+0x20/0x30 devm_iio_device_release+0x21/0x30 [industrialio] release_nodes+0xc3/0x3b0 devres_release_group+0x1da/0x2c0 i2c_device_probe+0x628/0xbb0 really_probe+0x285/0xc30 If __iio_device_register() fails, iio_dev_opaque->groups will be freed in error path in iio_device_unregister_sysfs(), then iio_dev_release() will call iio_device_unregister_sysfs() again, it causes double free. Set iio_dev_opaque->groups to NULL when it's freed to fix this double free. Fixes: 32f1717 ("iio: core: rework iio device group creation") Reported-by: Hulk Robot <[email protected]> Reviewed-by: Alexandru Ardelean <[email protected]> Signed-off-by: Yang Yingliang <[email protected]>

The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices and it shows the following trace. [ 20.923980][ T13] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 20.925234][ T13] #PF: supervisor read access in kernel mode [ 20.926122][ T13] #PF: error_code(0x0000) - not-present page [ 20.927118][ T13] PGD 0 P4D 0 [ 20.927607][ T13] Oops: Oops: 0000 [#1] PREEMPT SMP PTI [ 20.928424][ T13] CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ torvalds#474 [ 20.929740][ T13] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 20.930697][ T13] Workqueue: writeback wb_workfn (flush-btrfs-5) [ 20.931643][ T13] RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] [ 20.932573][ T1415] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 [ 20.932871][ T13] Code: ba e1 48 8b 7b 10 e8 f1 f5 f6 ff eb da 48 81 bf 10 01 00 00 40 0c 33 a0 74 09 40 88 b5 f1 00 00 00 eb b8 48 8b 85 18 01 00 00 <48> 8b 40 20 0f b7 50 24 f0 01 50 20 eb a3 0f 1f 40 00 90 90 90 90 [ 20.936623][ T13] RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 [ 20.937543][ T13] RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc [ 20.938788][ T13] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 [ 20.940016][ T13] RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 [ 20.941227][ T13] R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 [ 20.942375][ T13] R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 [ 20.943531][ T13] FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 [ 20.944838][ T13] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 20.945811][ T13] CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 [ 20.946984][ T13] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 20.948150][ T13] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 20.949327][ T13] Call Trace: [ 20.949949][ T13] <TASK> [ 20.950374][ T13] ? __die_body.cold+0x19/0x26 [ 20.951066][ T13] ? page_fault_oops+0x13e/0x2b0 [ 20.951766][ T13] ? _printk+0x58/0x73 [ 20.952358][ T13] ? do_user_addr_fault+0x5f/0x750 [ 20.953120][ T13] ? exc_page_fault+0x76/0x240 [ 20.953827][ T13] ? asm_exc_page_fault+0x22/0x30 [ 20.954606][ T13] ? btrfs_bio_end_io+0xae/0xc0 [btrfs] [ 20.955616][ T13] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] [ 20.956682][ T13] btrfs_orig_write_end_io+0x51/0x90 [btrfs] [ 20.957769][ T13] dm_submit_bio+0x5c2/0xa50 [dm_mod] [ 20.958623][ T13] ? find_held_lock+0x2b/0x80 [ 20.959339][ T13] ? blk_try_enter_queue+0x90/0x1e0 [ 20.960228][ T13] __submit_bio+0xe0/0x130 [ 20.960879][ T13] ? ktime_get+0x10a/0x160 [ 20.961546][ T13] ? lockdep_hardirqs_on+0x74/0x100 [ 20.962310][ T13] submit_bio_noacct_nocheck+0x199/0x410 [ 20.963140][ T13] btrfs_submit_bio+0x7d/0x150 [btrfs] [ 20.964089][ T13] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] [ 20.965066][ T13] ? lockdep_hardirqs_on+0x74/0x100 [ 20.965824][ T13] ? __folio_start_writeback+0x10/0x2c0 [ 20.966659][ T13] btrfs_submit_bbio+0x1c/0x40 [btrfs] [ 20.967617][ T13] submit_one_bio+0x44/0x60 [btrfs] [ 20.968536][ T13] submit_extent_folio+0x13f/0x330 [btrfs] [ 20.969552][ T13] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] [ 20.970625][ T13] extent_writepage_io+0x18b/0x360 [btrfs] [ 20.971632][ T13] extent_write_locked_range+0x17c/0x340 [btrfs] [ 20.972702][ T13] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] [ 20.973857][ T13] run_delalloc_cow+0x71/0xd0 [btrfs] [ 20.974841][ T13] btrfs_run_delalloc_range+0x176/0x500 [btrfs] [ 20.975870][ T13] ? find_lock_delalloc_range+0x119/0x260 [btrfs] [ 20.976911][ T13] writepage_delalloc+0x2ab/0x480 [btrfs] [ 20.977792][ T13] extent_write_cache_pages+0x236/0x7d0 [btrfs] [ 20.978728][ T13] btrfs_writepages+0x72/0x130 [btrfs] [ 20.979531][ T13] do_writepages+0xd4/0x240 [ 20.980111][ T13] ? find_held_lock+0x2b/0x80 [ 20.980695][ T13] ? wbc_attach_and_unlock_inode+0x12c/0x290 [ 20.981461][ T13] ? wbc_attach_and_unlock_inode+0x12c/0x290 [ 20.982213][ T13] __writeback_single_inode+0x5c/0x4c0 [ 20.982859][ T13] ? do_raw_spin_unlock+0x49/0xb0 [ 20.983439][ T13] writeback_sb_inodes+0x22c/0x560 [ 20.984079][ T13] __writeback_inodes_wb+0x4c/0xe0 [ 20.984886][ T13] wb_writeback+0x1d6/0x3f0 [ 20.985536][ T13] wb_workfn+0x334/0x520 [ 20.986044][ T13] process_one_work+0x1ee/0x570 [ 20.986580][ T13] ? lock_is_held_type+0xc6/0x130 [ 20.987142][ T13] worker_thread+0x1d1/0x3b0 [ 20.987918][ T13] ? __pfx_worker_thread+0x10/0x10 [ 20.988690][ T13] kthread+0xee/0x120 [ 20.989180][ T13] ? __pfx_kthread+0x10/0x10 [ 20.989915][ T13] ret_from_fork+0x30/0x50 [ 20.990615][ T13] ? __pfx_kthread+0x10/0x10 [ 20.991336][ T13] ret_from_fork_asm+0x1a/0x30 [ 20.992106][ T13] </TASK> [ 20.992482][ T13] Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl [ 20.993406][ T13] CR2: 0000000000000020 [ 20.993884][ T13] ---[ end trace 0000000000000000 ]--- [ 20.993954][ T1415] BUG: kernel NULL pointer dereference, address: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [1]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [1] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and uses the last saved error status for bbio->bio.bi_status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6+ Signed-off-by: Naohiro Aota <[email protected]>

The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ torvalds#474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6+ Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>

CerebralMischief added 17 commits October 6, 2016 10:30

Merge pull request #1 from torvalds/master

b03474c

Update from source fork.

Merge pull request #2 from torvalds/master

bc809a7

Update from source repo

Merge pull request #3 from torvalds/master

c500a34

Update from source repo

Merge pull request #4 from torvalds/master

5264bf7

Update from source repo

Merge pull request #5 from torvalds/master

947a213

Update from source Repo

Merge pull request #6 from torvalds/master

3c1c35b

Update from Source Repository

Merge pull request #7 from torvalds/master

7189073

Update from source repo

Merge pull request #8 from torvalds/master

0e92f9f

Update from Source Repository

Merge pull request #9 from torvalds/master

e1078af

Update from Source Repo

Merge pull request #10 from torvalds/master

2aecd31

Update from source repository

Merge pull request #11 from torvalds/master

d81449e

Update From Source Repository

Merge pull request #12 from torvalds/master

328ee07

Update from source repo

Merge pull request #13 from torvalds/master

1040ece

Update from Source Repository

Merge pull request #14 from torvalds/master

0e0235f

Update from source repository

Merge pull request #15 from torvalds/master

7b6189d

Update from source Repository

Merge pull request #16 from torvalds/master

e6d52fd

Update from source repo

Merge pull request #17 from torvalds/master

6398f5c

Update from Source Repository

CerebralMischief closed this Oct 1, 2017

CerebralMischief reopened this Oct 1, 2017

CerebralMischief closed this Oct 1, 2017

upa pushed a commit to upa/linux that referenced this pull request Jun 23, 2020

Merge pull request torvalds#474 from retrage/v5.3-fixes

06ca3dd

Suppress gcc warnings

sodar pushed a commit to sodar/linux that referenced this pull request Aug 5, 2021

Merge pull request torvalds#474 from LeSeulArtichaut/sound-container-of

24566f3

Fix soundness issue with `container_of!` macro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update from Source Repository #474

Update from Source Repository #474

CerebralMischief commented Oct 1, 2017

CerebralMischief commented Oct 1, 2017

CerebralMischief commented Oct 1, 2017

Update from Source Repository #474

Update from Source Repository #474

Conversation

CerebralMischief commented Oct 1, 2017

CerebralMischief commented Oct 1, 2017

CerebralMischief commented Oct 1, 2017