Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM: tegra: Add device-tree for ASUS Transformer Prime TF201 #4

Closed
wants to merge 2 commits into from
Closed

ARM: tegra: Add device-tree for ASUS Transformer Prime TF201 #4

wants to merge 2 commits into from

Conversation

clamor-s
Copy link
Contributor

@clamor-s clamor-s commented Jun 2, 2020

No description provided.

@clamor-s clamor-s closed this Jun 2, 2020
@clamor-s clamor-s deleted the clamor95-patch-1 branch June 2, 2020 08:00
@clamor-s
Copy link
Contributor Author

clamor-s commented Jun 2, 2020

Wrong stuff

digetx pushed a commit that referenced this pull request Jun 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jun 9, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jun 10, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jun 13, 2020
The code in decode_config4() of arch/mips/kernel/cpu-probe.c

        asid_mask = MIPS_ENTRYHI_ASID;
        if (config4 & MIPS_CONF4_AE)
                asid_mask |= MIPS_ENTRYHI_ASIDX;
        set_cpu_asid_mask(c, asid_mask);

set asid_mask to cpuinfo->asid_mask.

So in order to support variable ASID_MASK, KVM_ENTRYHI_ASID should also
be changed to cpu_asid_mask(&boot_cpu_data).

Cc: Stable <[email protected]>  #4.9+
Reviewed-by: Aleksandar Markovic <[email protected]>
Signed-off-by: Xing Li <[email protected]>
[Huacai: Change current_cpu_data to boot_cpu_data for optimization]
Signed-off-by: Huacai Chen <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
digetx pushed a commit that referenced this pull request Jun 13, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jun 17, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jun 28, 2020
Suppose that, for unrelated reasons, FSF requests on behalf of recovery are
very slow and can run into the ERP timeout.

In the case at hand, we did adapter recovery to a large degree.  However
due to the slowness a LUN open is pending so the corresponding fc_rport
remains blocked.  After fast_io_fail_tmo we trigger close physical port
recovery for the port under which the LUN should have been opened.  The new
higher order port recovery dismisses the pending LUN open ERP action and
dismisses the pending LUN open FSF request.  Such dismissal decouples the
ERP action from the pending corresponding FSF request by setting
zfcp_fsf_req->erp_action to NULL (among other things)
[zfcp_erp_strategy_check_fsfreq()].

If now the ERP timeout for the pending open LUN request runs out, we must
not use zfcp_fsf_req->erp_action in the ERP timeout handler.  This is a
problem since v4.15 commit 75492a5 ("s390/scsi: Convert timers to use
timer_setup()"). Before that we intentionally only passed zfcp_erp_action
as context argument to zfcp_erp_timeout_handler().

Note: The lifetime of the corresponding zfcp_fsf_req object continues until
a (late) response or an (unrelated) adapter recovery.

Just like the regular response path ignores dismissed requests
[zfcp_fsf_req_complete() => zfcp_fsf_protstatus_eval() => return early] the
ERP timeout handler now needs to ignore dismissed requests.  So simply
return early in the ERP timeout handler if the FSF request is marked as
dismissed in its status flags.  To protect against the race where
zfcp_erp_strategy_check_fsfreq() dismisses and sets
zfcp_fsf_req->erp_action to NULL after our previous status flag check,
return early if zfcp_fsf_req->erp_action is NULL.  After all, the former
ERP action does not need to be woken up as that was already done as part of
the dismissal above [zfcp_erp_action_dismiss()].

This fixes the following panic due to kernel page fault in IRQ context:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:000009859238c00b R2:00000e3e7ffd000b R3:00000e3e7ffcc007 S:00000e3e7ffd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
Modules linked in: ...
CPU: 82 PID: 311273 Comm: stress Kdump: loaded Tainted: G            E  X   ...
Hardware name: IBM 8561 T01 701 (LPAR)
Krnl PSW : 0404c00180000000 001fffff80549be0 (zfcp_erp_notify+0x40/0xc0 [zfcp])
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000080 00000e3d00000000 00000000000000f0 0000000000030000
           000000010028e700 000000000400a39c 000000010028e700 00000e3e7cf87e02
           0000000010000000 0700098591cb67f0 0000000000000000 0000000000000000
           0000033840e9a000 0000000000000000 001fffe008d6bc18 001fffe008d6bbc8
Krnl Code: 001fffff80549bd4: a7180000            lhi     %r1,0
           001fffff80549bd8: 4120a0f0            la      %r2,240(%r10)
          #001fffff80549bdc: a53e0003            llilh   %r3,3
          >001fffff80549be0: ba132000            cs      %r1,%r3,0(%r2)
           001fffff80549be4: a7740037            brc     7,1fffff80549c52
           001fffff80549be8: e320b0180004        lg      %r2,24(%r11)
           001fffff80549bee: e31020e00004        lg      %r1,224(%r2)
           001fffff80549bf4: 412020e0            la      %r2,224(%r2)
Call Trace:
 [<001fffff80549be0>] zfcp_erp_notify+0x40/0xc0 [zfcp]
 [<00000985915e26f0>] call_timer_fn+0x38/0x190
 [<00000985915e2944>] expire_timers+0xfc/0x190
 [<00000985915e2ac4>] run_timer_softirq+0xec/0x218
 [<0000098591ca7c4c>] __do_softirq+0x144/0x398
 [<00000985915110aa>] do_softirq_own_stack+0x72/0x88
 [<0000098591551b58>] irq_exit+0xb0/0xb8
 [<0000098591510c6a>] do_IRQ+0x82/0xb0
 [<0000098591ca7140>] ext_int_handler+0x128/0x12c
 [<0000098591722d98>] clear_subpage.constprop.13+0x38/0x60
([<000009859172ae4c>] clear_huge_page+0xec/0x250)
 [<000009859177e7a2>] do_huge_pmd_anonymous_page+0x32a/0x768
 [<000009859172a712>] __handle_mm_fault+0x88a/0x900
 [<000009859172a860>] handle_mm_fault+0xd8/0x1b0
 [<0000098591529ef6>] do_dat_exception+0x136/0x3e8
 [<0000098591ca6d34>] pgm_check_handler+0x1c8/0x220
Last Breaking-Event-Address:
 [<001fffff80549c88>] zfcp_erp_timeout_handler+0x10/0x18 [zfcp]
Kernel panic - not syncing: Fatal exception in interrupt

Link: https://lore.kernel.org/r/[email protected]
Fixes: 75492a5 ("s390/scsi: Convert timers to use timer_setup()")
Cc: <[email protected]> #4.15+
Reviewed-by: Julian Wiedmann <[email protected]>
Signed-off-by: Steffen Maier <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
digetx pushed a commit that referenced this pull request Jun 28, 2020
NULL pointer exception happens occasionally on serial output initiated
by login timeout.  This was reproduced only if kernel was built with
significant debugging options and EDMA driver is used with serial
console.

    col-vf50 login: root
    Password:
    Login timed out after 60 seconds.
    Unable to handle kernel NULL pointer dereference at virtual address 00000044
    Internal error: Oops: 5 [#1] ARM
    CPU: 0 PID: 157 Comm: login Not tainted 5.7.0-next-20200610-dirty #4
    Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
      (fsl_edma_tx_handler) from [<8016eb10>] (__handle_irq_event_percpu+0x64/0x304)
      (__handle_irq_event_percpu) from [<8016eddc>] (handle_irq_event_percpu+0x2c/0x7c)
      (handle_irq_event_percpu) from [<8016ee64>] (handle_irq_event+0x38/0x5c)
      (handle_irq_event) from [<801729e4>] (handle_fasteoi_irq+0xa4/0x160)
      (handle_fasteoi_irq) from [<8016ddcc>] (generic_handle_irq+0x34/0x44)
      (generic_handle_irq) from [<8016e40c>] (__handle_domain_irq+0x54/0xa8)
      (__handle_domain_irq) from [<80508bc8>] (gic_handle_irq+0x4c/0x80)
      (gic_handle_irq) from [<80100af0>] (__irq_svc+0x70/0x98)
    Exception stack(0x8459fe80 to 0x8459fec8)
    fe80: 72286b00 e3359f64 00000001 0000412d a0070013 85c98840 85c98840 a0070013
    fea0: 8054e0d4 00000000 00000002 00000000 00000002 8459fed0 8081fbe8 8081fbec
    fec0: 60070013 ffffffff
      (__irq_svc) from [<8081fbec>] (_raw_spin_unlock_irqrestore+0x30/0x58)
      (_raw_spin_unlock_irqrestore) from [<8056cb48>] (uart_flush_buffer+0x88/0xf8)
      (uart_flush_buffer) from [<80554e60>] (tty_ldisc_hangup+0x38/0x1ac)
      (tty_ldisc_hangup) from [<8054c7f4>] (__tty_hangup+0x158/0x2bc)
      (__tty_hangup) from [<80557b90>] (disassociate_ctty.part.1+0x30/0x23c)
      (disassociate_ctty.part.1) from [<8011fc18>] (do_exit+0x580/0xba0)
      (do_exit) from [<801214f8>] (do_group_exit+0x3c/0xb4)
      (do_group_exit) from [<80121580>] (__wake_up_parent+0x0/0x14)

Issue looks like race condition between interrupt handler fsl_edma_tx_handler()
(called as result of fsl_edma_xfer_desc()) and terminating the transfer with
fsl_edma_terminate_all().

The fsl_edma_tx_handler() handles interrupt for a transfer with already freed
edesc and idle==true.

Fixes: d6be34f ("dma: Add Freescale eDMA engine driver support")
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Reviewed-by: Robin Gong <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
digetx pushed a commit that referenced this pull request Jun 28, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
devm_gpiod_get_index() doesn't return NULL but -ENOENT when the
requested GPIO doesn't exist,  leading to the following messages:

[    2.742468] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.748147] can't set direction for gpio #2: -2
[    2.753081] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.758724] can't set direction for gpio #3: -2
[    2.763666] gpiod_direction_output: invalid GPIO (errorpointer)
[    2.769394] can't set direction for gpio #4: -2
[    2.774341] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.779981] can't set direction for gpio #5: -2
[    2.784545] ff000a20.serial: ttyCPM1 at MMIO 0xfff00a20 (irq = 39, base_baud = 8250000) is a CPM UART

Use devm_gpiod_get_index_optional() instead.

At the same time, handle the error case and properly exit
with an error.

Fixes: 97cbaf2 ("tty: serial: cpm_uart: Convert to use GPIO descriptors")
Cc: [email protected]
Cc: Linus Walleij <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/694a25fdce548c5ee8b060ef6a4b02746b8f25c0.1591986307.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Luo bin says:

====================
hinic: add some ethtool ops support

patch #1: support to set and get pause params with
          "ethtool -A/a" cmd
patch #2: support to set and get irq coalesce params with
          "ethtool -C/c" cmd
patch #3: support to do self test with "ethtool -t" cmd
patch #4: support to identify physical device with "ethtool -p" cmd
patch #5: support to get eeprom information with "ethtool -m" cmd
====================

Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Petr Machata says:

====================
TC: Introduce qevents

The Spectrum hardware allows execution of one of several actions as a
result of queue management decisions: tail-dropping, early-dropping,
marking a packet, or passing a configured latency threshold or buffer
size. Such packets can be mirrored, trapped, or sampled.

Modeling the action to be taken as simply a TC action is very attractive,
but it is not obvious where to put these actions. At least with ECN marking
one could imagine a tree of qdiscs and classifiers that effectively
accomplishes this task, albeit in an impractically complex manner. But
there is just no way to match on dropped-ness of a packet, let alone
dropped-ness due to a particular reason.

To allow configuring user-defined actions as a result of inner workings of
a qdisc, this patch set introduces a concept of qevents. Those are attach
points for TC blocks, where filters can be put that are executed as the
packet hits well-defined points in the qdisc algorithms. The attached
blocks can be shared, in a manner similar to clsact ingress and egress
blocks, arbitrary classifiers with arbitrary actions can be put on them,
etc.

For example:

	red limit 500K avpkt 1K qevent early_drop block 10
	matchall action mirred egress mirror dev eth1

The central patch #2 introduces several helpers to allow easy and uniform
addition of qevents to qdiscs: initialization, destruction, qevent block
number change validation, and qevent handling, i.e. dispatch of the filters
attached to the block bound to a qevent.

Patch #1 adds root_lock argument to qdisc enqueue op. The problem this is
tackling is that if a qevent filter pushes packets to the same qdisc tree
that holds the qevent in the first place, attempt to take qdisc root lock
for the second time will lead to a deadlock. To solve the issue, qevent
handler needs to unlock and relock the root lock around the filter
processing. Passing root_lock around makes it possible to get the lock
where it is needed, and visibly so, such that it is obvious the lock will
be used when invoking a qevent.

The following two patches, #3 and #4, then add two qevents to the RED
qdisc: "early_drop" qevent fires when a packet is early-dropped; "mark"
qevent, when it is ECN-marked.

Patch #5 contains a selftest. I have mentioned this test when pushing the
RED ECN nodrop mode and said that "I have no confidence in its portability
to [...] different configurations". That still holds. The backlog and
packet size are tuned to make the test deterministic. But it is better than
nothing, and on the boxes that I ran it on it does work and shows that
qevents work the way they are supposed to, and that their addition has not
broken the other tested features.

This patch set does not deal with offloading. The idea there is that a
driver will be able to figure out that a given block is used in qevent
context by looking at binder type. A future patch-set will add a qdisc
pointer to struct flow_block_offload, which a driver will be able to
consult to glean the TC or other relevant attributes.

Changes from RFC to v1:
- Move a "q = qdisc_priv(sch)" from patch #3 to patch #4
- Fix deadlock caused by mirroring packet back to the same qdisc tree.
- Rename "tail" qevent to "tail_drop".
- Adapt to the new 100-column standard.
- Add a selftest
====================

Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: multipath-tcp/mptcp_net-next#35
Reported-by: Christoph Paasch <[email protected]>
Suggested-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Ido Schimmel says:

====================
Add ethtool extended link state

Amit says:

Currently, device drivers can only indicate to user space if the network
link is up or down, without additional information.

This patch set provides an infrastructure that allows these drivers to
expose more information to user space about the link state. The
information can save users' time when trying to understand why a link is
not operationally up, for example.

The above is achieved by extending the existing ethtool LINKSTATE_GET
command with attributes that carry the extended state.

For example, no link due to missing cable:

$ ethtool ethX
...
Link detected: no (No cable)

Beside the general extended state, drivers can pass additional
information about the link state using the sub-state field. For example:

$ ethtool ethX
...
Link detected: no (Autoneg, No partner detected)

In the future the infrastructure can be extended - for example - to
allow PHY drivers to report whether a downshift to a lower speed
occurred. Something like:

$ ethtool ethX
...
Link detected: yes (downshifted)

Patch set overview:

Patches #1-#3 move mlxsw ethtool code to a separate file
Patches #4-#5 add the ethtool infrastructure for extended link state
Patches #6-#7 add support of extended link state in the mlxsw driver
Patches #8-#10 add test cases

Changes since v1:

* In documentation, show ETHTOOL_LINK_EXT_STATE_* and
  ETHTOOL_LINK_EXT_SUBSTATE_* constants instead of user-space strings
* Add `_CI_` to cable_issue substates to be consistent with
  other substates
* Keep the commit messages within 75 columns
* Use u8 variable for __link_ext_substate
* Document the meaning of -ENODATA in get_link_ext_state() callback
  description
* Do not zero data->link_ext_state_provided after getting an error
* Use `ret` variable for error value

Changes since RFC:

* Move documentation patch before ethtool patch
* Add nla_total_size() instead of sizeof() directly
* Return an error code from linkstate_get_ext_state()
* Remove SHORTED_CABLE, add CABLE_TEST_FAILURE instead
* Check if the interface is administratively up before setting ext_state
* Document all sub-states
====================

Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
This patch is to fix a crash:

 #3 [ffffb6580689f898] oops_end at ffffffffa2835bc2
 #4 [ffffb6580689f8b8] no_context at ffffffffa28766e7
 #5 [ffffb6580689f920] async_page_fault at ffffffffa320135e
    [exception RIP: f2fs_is_compressed_page+34]
    RIP: ffffffffa2ba83a2  RSP: ffffb6580689f9d8  RFLAGS: 00010213
    RAX: 0000000000000001  RBX: fffffc0f50b34bc0  RCX: 0000000000002122
    RDX: 0000000000002123  RSI: 0000000000000c00  RDI: fffffc0f50b34bc0
    RBP: ffff97e815a40178   R8: 0000000000000000   R9: ffff97e83ffc9000
    R10: 0000000000032300  R11: 0000000000032380  R12: ffffb6580689fa38
    R13: fffffc0f50b34bc0  R14: ffff97e825cbd000  R15: 0000000000000c00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffb6580689f9d8] __is_cp_guaranteed at ffffffffa2b7ea98
 #7 [ffffb6580689f9f0] f2fs_submit_page_write at ffffffffa2b81a69
 #8 [ffffb6580689fa30] f2fs_do_write_meta_page at ffffffffa2b99777
 #9 [ffffb6580689fae0] __f2fs_write_meta_page at ffffffffa2b75f1a
 #10 [ffffb6580689fb18] f2fs_sync_meta_pages at ffffffffa2b77466
 #11 [ffffb6580689fc98] do_checkpoint at ffffffffa2b78e46
 #12 [ffffb6580689fd88] f2fs_write_checkpoint at ffffffffa2b79c29
 #13 [ffffb6580689fdd0] f2fs_sync_fs at ffffffffa2b69d95
 #14 [ffffb6580689fe20] sync_filesystem at ffffffffa2ad2574
 #15 [ffffb6580689fe30] generic_shutdown_super at ffffffffa2a9b582
 #16 [ffffb6580689fe48] kill_block_super at ffffffffa2a9b6d1
 #17 [ffffb6580689fe60] kill_f2fs_super at ffffffffa2b6abe1
 #18 [ffffb6580689fea0] deactivate_locked_super at ffffffffa2a9afb6
 #19 [ffffb6580689feb8] cleanup_mnt at ffffffffa2abcad4
 #20 [ffffb6580689fee0] task_work_run at ffffffffa28bca28
 #21 [ffffb6580689ff00] exit_to_usermode_loop at ffffffffa28050b7
 #22 [ffffb6580689ff38] do_syscall_64 at ffffffffa280560e
 #23 [ffffb6580689ff50] entry_SYSCALL_64_after_hwframe at ffffffffa320008c

This occurred when umount f2fs if enable F2FS_FS_COMPRESSION
with F2FS_IO_TRACE. Fixes it by adding IS_IO_TRACED_PAGE to check
validity of pid for page_private.

Signed-off-by: Yu Changchun <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Jakub Sitnicki says:

====================
This patch set prepares ground for link-based multi-prog attachment for
future netns attach types, with BPF_SK_LOOKUP attach type in mind [0].

Two changes are needed in order to attach and run a series of BPF programs:

  1) an bpf_prog_array of programs to run (patch #2), and
  2) a list of attached links to keep track of attachments (patch #3).

Nothing changes for BPF flow_dissector. Just as before only one program can
be attached to netns.

In v3 I've simplified patch #2 that introduces bpf_prog_array to take
advantage of the fact that it will hold at most one program for now.

In particular, I'm no longer using bpf_prog_array_copy. It turned out to be
less suitable for link operations than I thought as it fails to append the
same BPF program.

bpf_prog_array_replace_item is also gone, because we know we always want to
replace the first element in prog_array.

Naturally the code that handles bpf_prog_array will need change once
more when there is a program type that allows multi-prog attachment. But I
feel it will be better to do it gradually and present it together with
tests that actually exercise multi-prog code paths.

[0] https://lore.kernel.org/bpf/[email protected]/

v2 -> v3:
- Don't check if run_array is null in link update callback. (Martin)
- Allow updating the link with the same BPF program. (Andrii)
- Add patch #4 with a test for the above case.
- Kill bpf_prog_array_replace_item. Access the run_array directly.
- Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...).
- Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer.
- Drop Andrii's Ack from patch #2. Code changed.

v1 -> v2:

- Show with a (void) cast that bpf_prog_array_replace_item() return value
  is ignored on purpose. (Andrii)
- Explain why bpf-cgroup cannot replace programs in bpf_prog_array based
  on bpf_prog pointer comparison in patch #2 description. (Andrii)
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
digetx pushed a commit that referenced this pull request Jul 1, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 3, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 8, 2020
Enable promisc mode of PF, set VF link state to enable, and
run iperf of the VF, then do self test of the PF. The self test
will fail with a low frequency, and may cause a use-after-free
problem.

[   87.142126] selftest:000004a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   87.159722] ==================================================================
[   87.174187] BUG: KASAN: use-after-free in hex_dump_to_buffer+0x140/0x608
[   87.187600] Read of size 1 at addr ffff003b22828000 by task ethtool/1186
[   87.201012]
[   87.203978] CPU: 7 PID: 1186 Comm: ethtool Not tainted 5.5.0-rc4-gfd51c473-dirty #4
[   87.219306] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B160.01 01/15/2020
[   87.238292] Call trace:
[   87.243173]  dump_backtrace+0x0/0x280
[   87.250491]  show_stack+0x24/0x30
[   87.257114]  dump_stack+0xe8/0x140
[   87.263911]  print_address_description.isra.8+0x70/0x380
[   87.274538]  __kasan_report+0x12c/0x230
[   87.282203]  kasan_report+0xc/0x18
[   87.288999]  __asan_load1+0x60/0x68
[   87.295969]  hex_dump_to_buffer+0x140/0x608
[   87.304332]  print_hex_dump+0x140/0x1e0
[   87.312000]  hns3_lb_check_skb_data+0x168/0x170
[   87.321060]  hns3_clean_rx_ring+0xa94/0xfe0
[   87.329422]  hns3_self_test+0x708/0x8c0

The length of packet sent by the selftest process is only
128 + 14 bytes, and the min buffer size of a BD is 256 bytes,
and the receive process will make sure the packet sent by
the selftest process is in the linear part, so only check
the linear part in hns3_lb_check_skb_data().

So fix this use-after-free by using skb_headlen() to dump
skb->data instead of skb->len.

Fixes: c39c4d9 ("net: hns3: Add mac loopback selftest support in hns3 driver")
Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Jul 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 9, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Jul 22, 2020
https://bugzilla.kernel.org/show_bug.cgi?id=208565

PID: 257    TASK: ecdd0000  CPU: 0   COMMAND: "init"
  #0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
  #1 [<c0b423c8>] (schedule) from [<c0b459d4>]
  #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
  #3 [<c0b44fa0>] (down_read) from [<c044233c>]
  #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
  #5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
  #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
  #7 [<c030be18>] (evict) from [<c030a558>]
  #8 [<c030a558>] (iput) from [<c047c600>]
  #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
 #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
 #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
 #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
 #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
 #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
 #15 [<c0324294>] (do_fsync) from [<c0324014>]
 #16 [<c0324014>] (sys_fsync) from [<c0108bc0>]

This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.

Reported-by: Zhiguo Niu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
digetx pushed a commit that referenced this pull request Jul 22, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Aug 8, 2020
Ido Schimmel says:

====================
mlxsw: Add support for buffer drop traps

Petr says:

A recent patch set added the ability to mirror buffer related drops
(e.g., early drops) through a netdev. This patch set adds the ability to
trap such packets to the local CPU for analysis.

The trapping towards the CPU is configured by using tc-trap action
instead of tc-mirred as was done when the packets were mirrored through
a netdev. A future patch set will also add the ability to sample the
dropped packets using tc-sample action.

The buffer related drop traps are added to devlink, which means that the
dropped packets can be reported to user space via the kernel's
drop_monitor module.

Patch set overview:

Patch #1 adds the early_drop trap to devlink

Patch #2 adds extack to a few devlink operations to facilitate better
error reporting to user space. This is necessary - among other things -
because the action of buffer drop traps cannot be changed in mlxsw

Patch #3 performs a small refactoring in mlxsw, patch #4 fixes a bug that
this patchset would trigger.

Patches #5-#6 add the infrastructure required to support different traps
/ trap groups in mlxsw per-ASIC. This is required because buffer drop
traps are not supported by Spectrum-1

Patch #7 extends mlxsw to register the early_drop trap

Patch #8 adds the offload logic for the "trap" action at a qevent block.

Patch #9 adds a mlxsw-specific selftest.
====================

Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Oct 22, 2021
Attempting to defragment a Btrfs file containing a transparent huge page
immediately deadlocks with the following stack trace:

  #0  context_switch (kernel/sched/core.c:4940:2)
  #1  __schedule (kernel/sched/core.c:6287:8)
  #2  schedule (kernel/sched/core.c:6366:3)
  #3  io_schedule (kernel/sched/core.c:8389:2)
  #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
  #5  __lock_page (mm/filemap.c:1648:2)
  #6  lock_page (./include/linux/pagemap.h:625:3)
  #7  pagecache_get_page (mm/filemap.c:1910:4)
  #8  find_or_create_page (./include/linux/pagemap.h:420:9)
  #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
  #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
  #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
  #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
  #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
  #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
  #15 vfs_ioctl (fs/ioctl.c:51:10)
  #16 __do_sys_ioctl (fs/ioctl.c:874:11)
  #17 __se_sys_ioctl (fs/ioctl.c:860:1)
  #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
  #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
  #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
  #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)

A huge page is represented by a compound page, which consists of a
struct page for each PAGE_SIZE page within the huge page. The first
struct page is the "head page", and the remaining are "tail pages".

Defragmentation attempts to lock each page in the range. However,
lock_page() on a tail page actually locks the corresponding head page.
So, if defragmentation tries to lock more than one struct page in a
compound page, it tries to lock the same head page twice and deadlocks
with itself.

Ideally, we should be able to defragment transparent huge pages.
However, THP for filesystems is currently read-only, so a lot of code is
not ready to use huge pages for I/O. For now, let's just return
ETXTBUSY.

This can be reproduced with the following on a kernel with
CONFIG_READ_ONLY_THP_FOR_FS=y:

  $ cat create_thp_file.c
  #include <fcntl.h>
  #include <stdbool.h>
  #include <stdio.h>
  #include <stdint.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <sys/mman.h>

  static const char zeroes[1024 * 1024];
  static const size_t FILE_SIZE = 2 * 1024 * 1024;

  int main(int argc, char **argv)
  {
          if (argc != 2) {
                  fprintf(stderr, "usage: %s PATH\n", argv[0]);
                  return EXIT_FAILURE;
          }
          int fd = creat(argv[1], 0777);
          if (fd == -1) {
                  perror("creat");
                  return EXIT_FAILURE;
          }
          size_t written = 0;
          while (written < FILE_SIZE) {
                  ssize_t ret = write(fd, zeroes,
                                      sizeof(zeroes) < FILE_SIZE - written ?
                                      sizeof(zeroes) : FILE_SIZE - written);
                  if (ret < 0) {
                          perror("write");
                          return EXIT_FAILURE;
                  }
                  written += ret;
          }
          close(fd);
          fd = open(argv[1], O_RDONLY);
          if (fd == -1) {
                  perror("open");
                  return EXIT_FAILURE;
          }

          /*
           * Reserve some address space so that we can align the file mapping to
           * the huge page size.
           */
          void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
          if (placeholder_map == MAP_FAILED) {
                  perror("mmap (placeholder)");
                  return EXIT_FAILURE;
          }

          void *aligned_address =
                  (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));

          void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                           MAP_SHARED | MAP_FIXED, fd, 0);
          if (map == MAP_FAILED) {
                  perror("mmap");
                  return EXIT_FAILURE;
          }
          if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                  perror("madvise");
                  return EXIT_FAILURE;
          }

          char *line = NULL;
          size_t line_capacity = 0;
          FILE *smaps_file = fopen("/proc/self/smaps", "r");
          if (!smaps_file) {
                  perror("fopen");
                  return EXIT_FAILURE;
          }
          for (;;) {
                  for (size_t off = 0; off < FILE_SIZE; off += 4096)
                          ((volatile char *)map)[off];

                  ssize_t ret;
                  bool this_mapping = false;
                  while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                          unsigned long start, end, huge;
                          if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                  this_mapping = (start <= (uintptr_t)map &&
                                                  (uintptr_t)map < end);
                          } else if (this_mapping &&
                                     sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                     huge > 0) {
                                  return EXIT_SUCCESS;
                          }
                  }

                  sleep(6);
                  rewind(smaps_file);
                  fflush(smaps_file);
          }
  }
  $ ./create_thp_file huge
  $ btrfs fi defrag -czstd ./huge

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Oct 22, 2021
The perf_buffer fails on system with offline cpus:

  # test_progs -t perf_buffer
  serial_test_perf_buffer:PASS:nr_cpus 0 nsec
  serial_test_perf_buffer:PASS:nr_on_cpus 0 nsec
  serial_test_perf_buffer:PASS:skel_load 0 nsec
  serial_test_perf_buffer:PASS:attach_kprobe 0 nsec
  serial_test_perf_buffer:PASS:perf_buf__new 0 nsec
  serial_test_perf_buffer:PASS:epoll_fd 0 nsec
  skipping offline CPU #4
  serial_test_perf_buffer:PASS:perf_buffer__poll 0 nsec
  serial_test_perf_buffer:PASS:seen_cpu_cnt 0 nsec
  serial_test_perf_buffer:PASS:buf_cnt 0 nsec
  ...
  serial_test_perf_buffer:PASS:fd_check 0 nsec
  serial_test_perf_buffer:PASS:drain_buf 0 nsec
  serial_test_perf_buffer:PASS:consume_buf 0 nsec
  serial_test_perf_buffer:FAIL:cpu_seen cpu 5 not seen
  #88 perf_buffer:FAIL
  Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED

If the offline cpu is from the middle of the possible set,
we get mismatch with possible and online cpu buffers.

The perf buffer test calls perf_buffer__consume_buffer for
all 'possible' cpus, but the library holds only 'online'
cpu buffers and perf_buffer__consume_buffer returns them
based on index.

Adding extra (online) index to keep track of online buffers,
we need the original (possible) index to trigger trace on
proper cpu.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
digetx pushed a commit that referenced this pull request Oct 22, 2021
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.

When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated if
uncorrectable errors happen.  By looking this deeper it turns out this
approach (truncating poisoned page) may incur silent data loss for all
non-readonly filesystems if the page is dirty.  It may be worse for
in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are actually
gone.

To solve this problem we could keep the poisoned dirty page in page cache
then notify the users on any later access, e.g.  page fault, read/write,
etc.  The clean page could be truncated as is since they can be reread
from disk later on.

The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault.  In
general, we need make the filesystems be aware of poisoned page before we
could keep the poisoned page in page cache in order to solve the data loss
problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it
once a page is returned from page cache lookup.  There are a couple of
options to do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more
changes than #1.  For #3 ERR_PTR will be returned so all the callers need
to check the return value otherwise invalid pointer may be dereferenced,
but not all callers really care about the content of the page, for
example, partial truncate which just sets the truncated range in one page
to 0.  So for such paths it needs additional modification if ERR_PTR is
returned.  And if the callers have their own way to handle the problematic
pages we need to add a new FGP flag to tell FGP functions to return the
pointer to the page.

It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug.  It seems this
problem had been slightly discussed before, but seems no action was taken
at that time.  [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than
others since even the data blocks are gone due to truncating.  So this
patchset starts from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.

This patch (of 4):

A minor cleanup to the indent.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Oct 22, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing oldmem_pfn_is_ram
mechanism.  Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we only
care about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping.  While we usually expect
sane setups to use "makedumfile", nothing really speaks against just
copying /proc/vmcore, especially in environments where HWpoisioning isn't
typically expected.  Also, we really don't want to put all our trust
completely on the memmap, so sanitizing also makes sense when just using
"makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Oct 25, 2021
Olga reports seeing the following Oops when doing O_DIRECT writes to a
pNFS flexfiles server:

Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 234186 Comm: kworker/u8:1 Not tainted 5.15.0-rc4+ #4
Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
Workqueue: nfsiod rpc_async_release [sunrpc]
RIP: 0010:nfs_mark_request_commit+0x12/0x30 [nfs]
Code: ff ff be 03 00 00 00 e8 ac 34 83 eb e9 29 ff ff
ff e8 22 bc d7 eb 66 90 0f 1f 44 00 00 48 85 f6 74 16 48 8b 42 10 48
8b 40 18 <48> 8b 40 18 48 85 c0 74 05 e9 70 fc 15 ec 48 89 d6 e9 68 ed
ff ff
RSP: 0018:ffffa82f0159fe00 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8f3393141880 RCX: 0000000000000000
RDX: ffffa82f0159fe08 RSI: ffff8f3381252500 RDI: ffff8f3393141880
RBP: ffff8f33ac317c00 R08: 0000000000000000 R09: ffff8f3487724cb0
R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000001
R13: ffff8f3485bccee0 R14: ffff8f33ac317c10 R15: ffff8f33ac317cd8
FS:  0000000000000000(0000) GS:ffff8f34fbc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 0000000122120006 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 nfs_direct_write_completion+0x13b/0x250 [nfs]
 rpc_free_task+0x39/0x60 [sunrpc]
 rpc_async_release+0x29/0x40 [sunrpc]
 process_one_work+0x1ce/0x370
 worker_thread+0x30/0x380
 ? process_one_work+0x370/0x370
 kthread+0x11a/0x140
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30

Reported-by: Olga Kornievskaia <[email protected]>
Fixes: 9c455a8 ("NFS/pNFS: Clean up pNFS commit operations")
Signed-off-by: Trond Myklebust <[email protected]>
digetx pushed a commit that referenced this pull request Oct 25, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing oldmem_pfn_is_ram
mechanism.  Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we only
care about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping.  While we usually expect
sane setups to use "makedumfile", nothing really speaks against just
copying /proc/vmcore, especially in environments where HWpoisioning isn't
typically expected.  Also, we really don't want to put all our trust
completely on the memmap, so sanitizing also makes sense when just using
"makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Oct 27, 2021
Ido Schimmel says:

====================
mlxsw: Support multiple RIF MAC prefixes

Currently, mlxsw enforces that all the netdevs used as router interfaces
(RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1).
Otherwise, an error is returned to user space with extack. This patchset
relaxes the limitation through the use of RIF MAC profiles.

A RIF MAC profile is a hardware entity that represents a particular MAC
prefix which multiple RIFs can reference. Therefore, the number of
possible MAC prefixes is no longer one, but the number of profiles
supported by the device.

The ability to change the MAC of a particular netdev is useful, for
example, for users who use the netdev to connect to an upstream provider
that performs MAC filtering. Currently, such users are either forced to
negotiate with the provider or change the MAC address of all other
netdevs so that they share the same prefix.

Patchset overview:

Patches #1-#3 are preparations.

Patch #4 adds actual support for RIF MAC profiles.

Patch #5 exposes RIF MAC profiles as a devlink resource, so that user
space has visibility into the maximum number of profiles and current
occupancy. Useful for debugging and testing (next 3 patches).

Patches #6-#8 add both scale and functional tests.

Patch #9 removes tests that validated the previous limitation. It is now
covered by patch #6 for devices that support a single profile.
====================

Signed-off-by: David S. Miller <[email protected]>
digetx pushed a commit that referenced this pull request Oct 27, 2021
We got the following lockdep splat while running fstests (specifically
btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
by 87579e9 ("loop: use worker per cgroup instead of kworker") which
converted loop to using workqueues, which comes with lockdep
annotations that don't exist with kworkers.  The lockdep splat is as
follows:

  WARNING: possible circular locking dependency detected
  5.14.0-rc2-custom+ #34 Not tainted
  ------------------------------------------------------
  losetup/156417 is trying to acquire lock:
  ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600

  but task is already holding lock:
  ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 lo_open+0x28/0x60 [loop]
	 blkdev_get_whole+0x28/0xf0
	 blkdev_get_by_dev.part.0+0x168/0x3c0
	 blkdev_open+0xd2/0xe0
	 do_dentry_open+0x163/0x3a0
	 path_openat+0x74d/0xa40
	 do_filp_open+0x9c/0x140
	 do_sys_openat2+0xb1/0x170
	 __x64_sys_openat+0x54/0x90
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #4 (&disk->open_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 blkdev_get_by_dev.part.0+0xd1/0x3c0
	 blkdev_get_by_path+0xc0/0xd0
	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
	 btrfs_control_ioctl+0xac/0x170 [btrfs]
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #3 (uuid_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 btrfs_rm_device+0x48/0x6a0 [btrfs]
	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #2 (sb_writers#11){.+.+}-{0:0}:
	 lo_write_bvec+0x112/0x290 [loop]
	 loop_process_work+0x25f/0xcb0 [loop]
	 process_one_work+0x28f/0x5d0
	 worker_thread+0x55/0x3c0
	 kthread+0x140/0x170
	 ret_from_fork+0x22/0x30

  -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
	 process_one_work+0x266/0x5d0
	 worker_thread+0x55/0x3c0
	 kthread+0x140/0x170
	 ret_from_fork+0x22/0x30

  -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
	 __lock_acquire+0x1130/0x1dc0
	 lock_acquire+0xf5/0x320
	 flush_workqueue+0xae/0x600
	 drain_workqueue+0xa0/0x110
	 destroy_workqueue+0x36/0x250
	 __loop_clr_fd+0x9a/0x650 [loop]
	 lo_ioctl+0x29d/0x780 [loop]
	 block_ioctl+0x3f/0x50
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  other info that might help us debug this:
  Chain exists of:
    (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
   Possible unsafe locking scenario:
	 CPU0                    CPU1
	 ----                    ----
    lock(&lo->lo_mutex);
				 lock(&disk->open_mutex);
				 lock(&lo->lo_mutex);
    lock((wq_completion)loop0);

   *** DEADLOCK ***
  1 lock held by losetup/156417:
   #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

  stack backtrace:
  CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack_lvl+0x57/0x72
   check_noncircular+0x10a/0x120
   __lock_acquire+0x1130/0x1dc0
   lock_acquire+0xf5/0x320
   ? flush_workqueue+0x84/0x600
   flush_workqueue+0xae/0x600
   ? flush_workqueue+0x84/0x600
   drain_workqueue+0xa0/0x110
   destroy_workqueue+0x36/0x250
   __loop_clr_fd+0x9a/0x650 [loop]
   lo_ioctl+0x29d/0x780 [loop]
   ? __lock_acquire+0x3a0/0x1dc0
   ? update_dl_rq_load_avg+0x152/0x360
   ? lock_is_held_type+0xa5/0x120
   ? find_held_lock.constprop.0+0x2b/0x80
   block_ioctl+0x3f/0x50
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7f645884de6b

Usually the uuid_mutex exists to protect the fs_devices that map
together all of the devices that match a specific uuid.  In rm_device
we're messing with the uuid of a device, so it makes sense to protect
that here.

However in doing that it pulls in a whole host of lockdep dependencies,
as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
we end up with the dependency chain under the uuid_mutex being added
under the normal sb write dependency chain, which causes problems with
loop devices.

We don't need the uuid mutex here however.  If we call
btrfs_scan_one_device() before we scratch the super block we will find
the fs_devices and not find the device itself and return EBUSY because
the fs_devices is open.  If we call it after the scratch happens it will
not appear to be a valid btrfs file system.

We do not need to worry about other fs_devices modifying operations here
because we're protected by the exclusive operations locking.

So drop the uuid_mutex here in order to fix the lockdep splat.

A more detailed explanation from the discussion:

We are worried about rm and scan racing with each other, before this
change we'll zero the device out under the UUID mutex so when scan does
run it'll make sure that it can go through the whole device scan thing
without rm messing with us.

We aren't worried if the scratch happens first, because the result is we
don't think this is a btrfs device and we bail out.

The only case we are concerned with is we scratch _after_ scan is able
to read the superblock and gets a seemingly valid super block, so lets
consider this case.

Scan will call device_list_add() with the device we're removing.  We'll
call find_fsid_with_metadata_uuid() and get our fs_devices for this
UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
what protects us in this case, but we have two cases here.

1. We aren't to the device removal part of the RM.  We found our device,
   and device name matches our path, we go down and we set total_devices
   to our super number of devices, which doesn't affect anything because
   we haven't done the remove yet.

2. We are past the device removal part, which is protected by the
   device_list_mutex.  Scan doesn't find the device, it goes down and
   does the

   if (fs_devices->opened)
	   return -EBUSY;

   check and we bail out.

Nothing about this situation is ideal, but the lockdep splat is real,
and the fix is safe, tho admittedly a bit scary looking.

Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ copy more from the discussion ]
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Oct 27, 2021
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.

However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       lo_open+0x28/0x60 [loop]
       blkdev_get_whole+0x25/0xf0
       blkdev_get_by_dev.part.0+0x168/0x3c0
       blkdev_open+0xd2/0xe0
       do_dentry_open+0x161/0x390
       path_openat+0x3cc/0xa20
       do_filp_open+0x96/0x120
       do_sys_openat2+0x7b/0x130
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       blkdev_get_by_dev.part.0+0x56/0x3c0
       blkdev_get_by_path+0x98/0xa0
       btrfs_get_bdev_and_sb+0x1b/0xb0
       btrfs_find_device_by_devspec+0x12b/0x1c0
       btrfs_rm_device+0x127/0x610
       btrfs_ioctl+0x2a31/0x2e70
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
       lo_write_bvec+0xc2/0x240 [loop]
       loop_process_work+0x238/0xd00 [loop]
       process_one_work+0x26b/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x245/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       flush_workqueue+0x91/0x5e0
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&lo->lo_mutex);
                               lock(&disk->open_mutex);
                               lock(&lo->lo_mutex);
  lock((wq_completion)loop0);

 *** DEADLOCK ***

1 lock held by losetup/11576:
 #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack_lvl+0x57/0x72
 check_noncircular+0xcf/0xf0
 ? stack_trace_save+0x3b/0x50
 __lock_acquire+0x10ea/0x1d90
 lock_acquire+0xb5/0x2b0
 ? flush_workqueue+0x67/0x5e0
 ? lockdep_init_map_type+0x47/0x220
 flush_workqueue+0x91/0x5e0
 ? flush_workqueue+0x67/0x5e0
 ? verify_cpu+0xf0/0x100
 drain_workqueue+0xa0/0x110
 destroy_workqueue+0x36/0x250
 __loop_clr_fd+0x9a/0x660 [loop]
 ? blkdev_ioctl+0x8d/0x2a0
 block_ioctl+0x3f/0x50
 __x64_sys_ioctl+0x80/0xb0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb

Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device().  From
there we can find the device and do the appropriate removal.

Suggested-by: Anand Jain <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Oct 27, 2021
Attempting to defragment a Btrfs file containing a transparent huge page
immediately deadlocks with the following stack trace:

  #0  context_switch (kernel/sched/core.c:4940:2)
  #1  __schedule (kernel/sched/core.c:6287:8)
  #2  schedule (kernel/sched/core.c:6366:3)
  #3  io_schedule (kernel/sched/core.c:8389:2)
  #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
  #5  __lock_page (mm/filemap.c:1648:2)
  #6  lock_page (./include/linux/pagemap.h:625:3)
  #7  pagecache_get_page (mm/filemap.c:1910:4)
  #8  find_or_create_page (./include/linux/pagemap.h:420:9)
  #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
  #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
  #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
  #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
  #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
  #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
  #15 vfs_ioctl (fs/ioctl.c:51:10)
  #16 __do_sys_ioctl (fs/ioctl.c:874:11)
  #17 __se_sys_ioctl (fs/ioctl.c:860:1)
  #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
  #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
  #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
  #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)

A huge page is represented by a compound page, which consists of a
struct page for each PAGE_SIZE page within the huge page. The first
struct page is the "head page", and the remaining are "tail pages".

Defragmentation attempts to lock each page in the range. However,
lock_page() on a tail page actually locks the corresponding head page.
So, if defragmentation tries to lock more than one struct page in a
compound page, it tries to lock the same head page twice and deadlocks
with itself.

Ideally, we should be able to defragment transparent huge pages.
However, THP for filesystems is currently read-only, so a lot of code is
not ready to use huge pages for I/O. For now, let's just return
ETXTBUSY.

This can be reproduced with the following on a kernel with
CONFIG_READ_ONLY_THP_FOR_FS=y:

  $ cat create_thp_file.c
  #include <fcntl.h>
  #include <stdbool.h>
  #include <stdio.h>
  #include <stdint.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <sys/mman.h>

  static const char zeroes[1024 * 1024];
  static const size_t FILE_SIZE = 2 * 1024 * 1024;

  int main(int argc, char **argv)
  {
          if (argc != 2) {
                  fprintf(stderr, "usage: %s PATH\n", argv[0]);
                  return EXIT_FAILURE;
          }
          int fd = creat(argv[1], 0777);
          if (fd == -1) {
                  perror("creat");
                  return EXIT_FAILURE;
          }
          size_t written = 0;
          while (written < FILE_SIZE) {
                  ssize_t ret = write(fd, zeroes,
                                      sizeof(zeroes) < FILE_SIZE - written ?
                                      sizeof(zeroes) : FILE_SIZE - written);
                  if (ret < 0) {
                          perror("write");
                          return EXIT_FAILURE;
                  }
                  written += ret;
          }
          close(fd);
          fd = open(argv[1], O_RDONLY);
          if (fd == -1) {
                  perror("open");
                  return EXIT_FAILURE;
          }

          /*
           * Reserve some address space so that we can align the file mapping to
           * the huge page size.
           */
          void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
          if (placeholder_map == MAP_FAILED) {
                  perror("mmap (placeholder)");
                  return EXIT_FAILURE;
          }

          void *aligned_address =
                  (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));

          void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                           MAP_SHARED | MAP_FIXED, fd, 0);
          if (map == MAP_FAILED) {
                  perror("mmap");
                  return EXIT_FAILURE;
          }
          if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                  perror("madvise");
                  return EXIT_FAILURE;
          }

          char *line = NULL;
          size_t line_capacity = 0;
          FILE *smaps_file = fopen("/proc/self/smaps", "r");
          if (!smaps_file) {
                  perror("fopen");
                  return EXIT_FAILURE;
          }
          for (;;) {
                  for (size_t off = 0; off < FILE_SIZE; off += 4096)
                          ((volatile char *)map)[off];

                  ssize_t ret;
                  bool this_mapping = false;
                  while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                          unsigned long start, end, huge;
                          if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                  this_mapping = (start <= (uintptr_t)map &&
                                                  (uintptr_t)map < end);
                          } else if (this_mapping &&
                                     sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                     huge > 0) {
                                  return EXIT_SUCCESS;
                          }
                  }

                  sleep(6);
                  rewind(smaps_file);
                  fflush(smaps_file);
          }
  }
  $ ./create_thp_file huge
  $ btrfs fi defrag -czstd ./huge

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Oct 28, 2021
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.

When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated if
uncorrectable errors happen.  By looking this deeper it turns out this
approach (truncating poisoned page) may incur silent data loss for all
non-readonly filesystems if the page is dirty.  It may be worse for
in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are actually
gone.

To solve this problem we could keep the poisoned dirty page in page cache
then notify the users on any later access, e.g.  page fault, read/write,
etc.  The clean page could be truncated as is since they can be reread
from disk later on.

The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault.  In
general, we need make the filesystems be aware of poisoned page before we
could keep the poisoned page in page cache in order to solve the data loss
problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it
once a page is returned from page cache lookup.  There are a couple of
options to do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more
changes than #1.  For #3 ERR_PTR will be returned so all the callers need
to check the return value otherwise invalid pointer may be dereferenced,
but not all callers really care about the content of the page, for
example, partial truncate which just sets the truncated range in one page
to 0.  So for such paths it needs additional modification if ERR_PTR is
returned.  And if the callers have their own way to handle the problematic
pages we need to add a new FGP flag to tell FGP functions to return the
pointer to the page.

It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug.  It seems this
problem had been slightly discussed before, but seems no action was taken
at that time.  [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than
others since even the data blocks are gone due to truncating.  So this
patchset starts from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.

This patch (of 4):

A minor cleanup to the indent.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Oct 28, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing oldmem_pfn_is_ram
mechanism.  Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we only
care about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping.  While we usually expect
sane setups to use "makedumfile", nothing really speaks against just
copying /proc/vmcore, especially in environments where HWpoisioning isn't
typically expected.  Also, we really don't want to put all our trust
completely on the memmap, so sanitizing also makes sense when just using
"makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Nov 1, 2021
Move cifs/smb to using the alternate fallback fscache I/O API instead of
the old upstream I/O API as that is about to be deleted.  The alternate API
will also be deleted at some point in the future as it's dangerous (as is
the old API) and can lead to data corruption if the backing filesystem can
insert/remove bridging blocks of zeros into its extent list[1].

The alternate API reads and writes pages synchronously, with the intention
of allowing removal of the operation management framework and thence the
object management framework from fscache.

The preferred change would be to use the netfs lib, but the new I/O API can
be used directly.  It's just that as the cache now needs to track data for
itself, caching blocks may exceed page size...

Changes
=======
ver #4:
  - cifs_readpage_to_fscache() shouldn't test the PG_fscache bit on a page
    to determine if that page should be written to disk.  That bit is no
    longer used like that.

ver #2:
  - Changed "deprecated" to "fallback" in the new function names[2].

Signed-off-by: David Howells <[email protected]>
cc: Steve French <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected] [1]
Link: https://lore.kernel.org/r/CAHk-=wiVK+1CyEjW8u71zVPK8msea=qPpznX35gnX+s8sXnJTg@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/163162773867.438332.3585429891151112562.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/163189112708.2509237.17528578040344723638.stgit@warthog.procyon.org.uk/ # rfc v2
digetx pushed a commit that referenced this pull request Nov 1, 2021
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.

When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated if
uncorrectable errors happen.  By looking this deeper it turns out this
approach (truncating poisoned page) may incur silent data loss for all
non-readonly filesystems if the page is dirty.  It may be worse for
in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are actually
gone.

To solve this problem we could keep the poisoned dirty page in page cache
then notify the users on any later access, e.g.  page fault, read/write,
etc.  The clean page could be truncated as is since they can be reread
from disk later on.

The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault.  In
general, we need make the filesystems be aware of poisoned page before we
could keep the poisoned page in page cache in order to solve the data loss
problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it
once a page is returned from page cache lookup.  There are a couple of
options to do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more
changes than #1.  For #3 ERR_PTR will be returned so all the callers need
to check the return value otherwise invalid pointer may be dereferenced,
but not all callers really care about the content of the page, for
example, partial truncate which just sets the truncated range in one page
to 0.  So for such paths it needs additional modification if ERR_PTR is
returned.  And if the callers have their own way to handle the problematic
pages we need to add a new FGP flag to tell FGP functions to return the
pointer to the page.

It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug.  It seems this
problem had been slightly discussed before, but seems no action was taken
at that time.  [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than
others since even the data blocks are gone due to truncating.  So this
patchset starts from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.

This patch (of 4):

A minor cleanup to the indent.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Nov 1, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing oldmem_pfn_is_ram
mechanism.  Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we only
care about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping.  While we usually expect
sane setups to use "makedumfile", nothing really speaks against just
copying /proc/vmcore, especially in environments where HWpoisioning isn't
typically expected.  Also, we really don't want to put all our trust
completely on the memmap, so sanitizing also makes sense when just using
"makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Nov 5, 2021
Convert the 9p filesystem to use the netfs helper lib to handle readpage,
readahead and write_begin, converting those into a common issue_op for the
filesystem itself to handle.  The netfs helper lib also handles reading
from fscache if a cache is available, and interleaving reads from both
sources.

This change also switches from the old fscache I/O API to the new one,
meaning that fscache no longer keeps track of netfs pages and instead does
async DIO between the backing files and the 9p file pagecache.  As a part
of this change, the handling of PG_fscache changes.  It now just means that
the cache has a write I/O operation in progress on a page (PG_locked
is used for a read I/O op).

Note that this is a cut-down version of the fscache rewrite and does not
change any of the cookie and cache coherency handling.

Changes
=======
ver #4:
  - Rebase on top of folios.
  - Don't use wait_on_page_bit_killable().

ver #3:
  - v9fs_req_issue_op() needs to terminate the subrequest.
  - v9fs_write_end() needs to call SetPageUptodate() a bit more often.
  - It's not CONFIG_{AFS,V9FS}_FSCACHE[1]
  - v9fs_init_rreq() should take a ref on the p9_fid and the cleanup should
    drop it [from Dominique Martinet].

Signed-off-by: David Howells <[email protected]>
Reviewed-and-tested-by: Dominique Martinet <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ [1]
Link: https://lore.kernel.org/r/163162772646.438332.16323773205855053535.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/163189109885.2509237.7153668924503399173.stgit@warthog.procyon.org.uk/ # rfc v2
Link: https://lore.kernel.org/r/163363943896.1980952.1226527304649419689.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163551662876.1877519.14706391695553204156.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163584179557.4023316.11089762304657644342.stgit@warthog.procyon.org.uk # rebase on folio
Signed-off-by: Dominique Martinet <[email protected]>
digetx pushed a commit that referenced this pull request Nov 5, 2021
Host crashes when pci_enable_atomic_ops_to_root() is called for VFs with
virtual buses. The virtual buses added to SR-IOV have bus->self set to NULL
and host crashes due to this.

  PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
  ...
   #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
   #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
   #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
   #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
      [exception RIP: pcie_capability_read_dword+28]
      RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
      RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
      RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
      RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
      R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
      R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
   #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
   #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]

Per PCIe r5.0, sec 9.3.5.10, the AtomicOp Requester Enable bit in Device
Control 2 is reserved for VFs.  The PF value applies to all associated VFs.

Return -EINVAL if pci_enable_atomic_ops_to_root() is called for a VF.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 35f5ace ("RDMA/bnxt_re: Enable global atomic ops if platform supports")
Fixes: 430a236 ("PCI: Add pci_enable_atomic_ops_to_root()")
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Gospodarek <[email protected]>
digetx pushed a commit that referenced this pull request Nov 9, 2021
It is generally unsafe to call put_device() with dpm_list_mtx held,
because the given device's release routine may carry out an action
depending on that lock which then may deadlock, so modify the
system-wide suspend and resume of devices to always drop dpm_list_mtx
before calling put_device() (and adjust white space somewhat while
at it).

For instance, this prevents the following splat from showing up in
the kernel log after a system resume in certain configurations:

[ 3290.969514] ======================================================
[ 3290.969517] WARNING: possible circular locking dependency detected
[ 3290.969519] 5.15.0+ #2420 Tainted: G S
[ 3290.969523] ------------------------------------------------------
[ 3290.969525] systemd-sleep/4553 is trying to acquire lock:
[ 3290.969529] ffff888117ab1138 ((wq_completion)hci0#2){+.+.}-{0:0}, at: flush_workqueue+0x87/0x4a0
[ 3290.969554]
               but task is already holding lock:
[ 3290.969556] ffffffff8280fca8 (dpm_list_mtx){+.+.}-{3:3}, at: dpm_resume+0x12e/0x3e0
[ 3290.969571]
               which lock already depends on the new lock.

[ 3290.969573]
               the existing dependency chain (in reverse order) is:
[ 3290.969575]
               -> #3 (dpm_list_mtx){+.+.}-{3:3}:
[ 3290.969583]        __mutex_lock+0x9d/0xa30
[ 3290.969591]        device_pm_add+0x2e/0xe0
[ 3290.969597]        device_add+0x4d5/0x8f0
[ 3290.969605]        hci_conn_add_sysfs+0x43/0xb0 [bluetooth]
[ 3290.969689]        hci_conn_complete_evt.isra.71+0x124/0x750 [bluetooth]
[ 3290.969747]        hci_event_packet+0xd6c/0x28a0 [bluetooth]
[ 3290.969798]        hci_rx_work+0x213/0x640 [bluetooth]
[ 3290.969842]        process_one_work+0x2aa/0x650
[ 3290.969851]        worker_thread+0x39/0x400
[ 3290.969859]        kthread+0x142/0x170
[ 3290.969865]        ret_from_fork+0x22/0x30
[ 3290.969872]
               -> #2 (&hdev->lock){+.+.}-{3:3}:
[ 3290.969881]        __mutex_lock+0x9d/0xa30
[ 3290.969887]        hci_event_packet+0xba/0x28a0 [bluetooth]
[ 3290.969935]        hci_rx_work+0x213/0x640 [bluetooth]
[ 3290.969978]        process_one_work+0x2aa/0x650
[ 3290.969985]        worker_thread+0x39/0x400
[ 3290.969993]        kthread+0x142/0x170
[ 3290.969999]        ret_from_fork+0x22/0x30
[ 3290.970004]
               -> #1 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}:
[ 3290.970013]        process_one_work+0x27d/0x650
[ 3290.970020]        worker_thread+0x39/0x400
[ 3290.970028]        kthread+0x142/0x170
[ 3290.970033]        ret_from_fork+0x22/0x30
[ 3290.970038]
               -> #0 ((wq_completion)hci0#2){+.+.}-{0:0}:
[ 3290.970047]        __lock_acquire+0x15cb/0x1b50
[ 3290.970054]        lock_acquire+0x26c/0x300
[ 3290.970059]        flush_workqueue+0xae/0x4a0
[ 3290.970066]        drain_workqueue+0xa1/0x130
[ 3290.970073]        destroy_workqueue+0x34/0x1f0
[ 3290.970081]        hci_release_dev+0x49/0x180 [bluetooth]
[ 3290.970130]        bt_host_release+0x1d/0x30 [bluetooth]
[ 3290.970195]        device_release+0x33/0x90
[ 3290.970201]        kobject_release+0x63/0x160
[ 3290.970211]        dpm_resume+0x164/0x3e0
[ 3290.970215]        dpm_resume_end+0xd/0x20
[ 3290.970220]        suspend_devices_and_enter+0x1a4/0xba0
[ 3290.970229]        pm_suspend+0x26b/0x310
[ 3290.970236]        state_store+0x42/0x90
[ 3290.970243]        kernfs_fop_write_iter+0x135/0x1b0
[ 3290.970251]        new_sync_write+0x125/0x1c0
[ 3290.970257]        vfs_write+0x360/0x3c0
[ 3290.970263]        ksys_write+0xa7/0xe0
[ 3290.970269]        do_syscall_64+0x3a/0x80
[ 3290.970276]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3290.970284]
               other info that might help us debug this:

[ 3290.970285] Chain exists of:
                 (wq_completion)hci0#2 --> &hdev->lock --> dpm_list_mtx

[ 3290.970297]  Possible unsafe locking scenario:

[ 3290.970299]        CPU0                    CPU1
[ 3290.970300]        ----                    ----
[ 3290.970302]   lock(dpm_list_mtx);
[ 3290.970306]                                lock(&hdev->lock);
[ 3290.970310]                                lock(dpm_list_mtx);
[ 3290.970314]   lock((wq_completion)hci0#2);
[ 3290.970319]
                *** DEADLOCK ***

[ 3290.970321] 7 locks held by systemd-sleep/4553:
[ 3290.970325]  #0: ffff888103bcd448 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xa7/0xe0
[ 3290.970341]  #1: ffff888115a14488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[ 3290.970355]  #2: ffff888100f719e0 (kn->active#233){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[ 3290.970369]  #3: ffffffff82661048 (autosleep_lock){+.+.}-{3:3}, at: state_store+0x12/0x90
[ 3290.970384]  #4: ffffffff82658ac8 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x9f/0x310
[ 3290.970399]  #5: ffffffff827f2a48 (acpi_scan_lock){+.+.}-{3:3}, at: acpi_suspend_begin+0x4c/0x80
[ 3290.970416]  #6: ffffffff8280fca8 (dpm_list_mtx){+.+.}-{3:3}, at: dpm_resume+0x12e/0x3e0
[ 3290.970428]
               stack backtrace:
[ 3290.970431] CPU: 3 PID: 4553 Comm: systemd-sleep Tainted: G S                5.15.0+ #2420
[ 3290.970438] Hardware name: Dell Inc. XPS 13 9380/0RYJWW, BIOS 1.5.0 06/03/2019
[ 3290.970441] Call Trace:
[ 3290.970446]  dump_stack_lvl+0x44/0x57
[ 3290.970454]  check_noncircular+0x105/0x120
[ 3290.970468]  ? __lock_acquire+0x15cb/0x1b50
[ 3290.970474]  __lock_acquire+0x15cb/0x1b50
[ 3290.970487]  lock_acquire+0x26c/0x300
[ 3290.970493]  ? flush_workqueue+0x87/0x4a0
[ 3290.970503]  ? __raw_spin_lock_init+0x3b/0x60
[ 3290.970510]  ? lockdep_init_map_type+0x58/0x240
[ 3290.970519]  flush_workqueue+0xae/0x4a0
[ 3290.970526]  ? flush_workqueue+0x87/0x4a0
[ 3290.970544]  ? drain_workqueue+0xa1/0x130
[ 3290.970552]  drain_workqueue+0xa1/0x130
[ 3290.970561]  destroy_workqueue+0x34/0x1f0
[ 3290.970572]  hci_release_dev+0x49/0x180 [bluetooth]
[ 3290.970624]  bt_host_release+0x1d/0x30 [bluetooth]
[ 3290.970687]  device_release+0x33/0x90
[ 3290.970695]  kobject_release+0x63/0x160
[ 3290.970705]  dpm_resume+0x164/0x3e0
[ 3290.970710]  ? dpm_resume_early+0x251/0x3b0
[ 3290.970718]  dpm_resume_end+0xd/0x20
[ 3290.970723]  suspend_devices_and_enter+0x1a4/0xba0
[ 3290.970737]  pm_suspend+0x26b/0x310
[ 3290.970746]  state_store+0x42/0x90
[ 3290.970755]  kernfs_fop_write_iter+0x135/0x1b0
[ 3290.970764]  new_sync_write+0x125/0x1c0
[ 3290.970777]  vfs_write+0x360/0x3c0
[ 3290.970785]  ksys_write+0xa7/0xe0
[ 3290.970794]  do_syscall_64+0x3a/0x80
[ 3290.970803]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3290.970811] RIP: 0033:0x7f41b1328164
[ 3290.970819] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 00 8b 05 4a d2 2c 00 48 63 ff 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 55 53 48 89 d5 48 89 f3 48 83
[ 3290.970824] RSP: 002b:00007ffe6ae21b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3290.970831] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f41b1328164
[ 3290.970836] RDX: 0000000000000004 RSI: 000055965e651070 RDI: 0000000000000004
[ 3290.970839] RBP: 000055965e651070 R08: 000055965e64f390 R09: 00007f41b1e3d1c0
[ 3290.970843] R10: 000000000000000a R11: 0000000000000246 R12: 0000000000000004
[ 3290.970846] R13: 0000000000000001 R14: 000055965e64f2b0 R15: 0000000000000004

Cc: All applicable <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
digetx pushed a commit that referenced this pull request Nov 9, 2021
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.

When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen.  By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty.  It may be worse for
in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are
actually gone.

To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g.  page fault,
read/write, etc.  The clean page could be truncated as is since they can
be reread from disk later on.

The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault.  In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.

To make filesystems be aware of poisoned page we should consider:

 - The page should be not written back: clearing dirty flag could
   prevent from writeback.

 - The page should not be dropped (it shows as a clean page) by drop
   caches or other callers: the refcount pin from hwpoison could prevent
   from invalidating (called by cache drop, inode cache shrinking, etc),
   but it doesn't avoid invalidation in DIO path.

 - The page should be able to get truncated/hole punched/unlinked: it
   works as it is.

 - Notify users when the page is accessed, e.g. read/write, page fault
   and other paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup.  There are a couple
of options to do it:

 1. Check hwpoison flag for every path, the most straightforward way.

 2. Return NULL for poisoned page from page cache lookup, the most
    callsites check if NULL is returned, this should have least work I
    think. But the error handling in filesystems just return -ENOMEM,
    the error code will incur confusion to the users obviously.

 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
    but this will involve significant amount of code change as well
    since all the paths need check if the pointer is ERR or not just
    like option #1.

I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1.  For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0.  So for such paths it needs additional modification if
ERR_PTR is returned.  And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.

It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug.  It seems
this problem had been slightly discussed before, but seems no action was
taken at that time.  [2]

As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems.  But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.

This patch (of 4):

A minor cleanup to the indent.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
digetx pushed a commit that referenced this pull request Nov 9, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing oldmem_pfn_is_ram
mechanism.  Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we only
care about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping.  While we usually expect
sane setups to use "makedumfile", nothing really speaks against just
copying /proc/vmcore, especially in environments where HWpoisioning isn't
typically expected.  Also, we really don't want to put all our trust
completely on the memmap, so sanitizing also makes sense when just using
"makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
digetx pushed a commit that referenced this pull request Nov 10, 2021
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially
access logically unplugged memory managed by a virtio-mem device:
/proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part
of a virtio-mem device.

Patch #1-#3 are cleanups.  Patch #4 extends the existing
oldmem_pfn_is_ram mechanism.  Patch #5-#7 are virtio-mem refactorings
for patch #8, which implements the virtio-mem logic to query the state
of device blocks.

Patch #8:
 "Although virtio-mem currently supports reading unplugged memory in the
  hypervisor, this will change in the future, indicated to the device
  via a new feature flag. We similarly sanitized /proc/kcore access
  recently.
  [...]
  Distributions that support virtio-mem+kdump have to make sure that the
  virtio_mem module will be part of the kdump kernel or the kdump
  initrd; dracut was recently [2] extended to include virtio-mem in the
  generated initrd. As long as no special kdump kernels are used, this
  will automatically make sure that virtio-mem will be around in the
  kdump initrd and sanitize /proc/vmcore access -- with dracut"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort.  We'll never be able to control what runs
inside the second kernel, really, but we also don't have to care: we
only care about sane setups where we don't want our VM getting zapped
once we touch the wrong memory location while dumping.  While we usually
expect sane setups to use "makedumfile", nothing really speaks against
just copying /proc/vmcore, especially in environments where HWpoisioning
isn't typically expected.  Also, we really don't want to put all our
trust completely on the memmap, so sanitizing also makes sense when just
using "makedumpfile".

[1] https://lkml.kernel.org/r/[email protected]
[2] dracutdevs/dracut#1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

This patch (of 9):

The callback is only used for the vmcore nowadays.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
digetx pushed a commit that referenced this pull request Nov 16, 2021
Convert the netfs helper library to use folios throughout, convert the 9p
and afs filesystems to use folios in their file I/O paths and convert the
ceph filesystem to use just enough folios to compile.

With these changes, afs passes -g quick xfstests.

Changes
=======
ver #5:
 - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does
   instead (Willy decided he didn't want this after all).

ver #4:
 - Fixed a bug in afs_redirty_page() whereby it didn't set the next page
   index in the loop and returned too early.
 - Simplified a check in v9fs_vfs_write_folio_locked()[1].
 - Undid a change to afs_symlink_readpage()[1].
 - Used offset_in_folio() in afs_write_end()[1].
 - Changed from using page_endio() to folio_end{io,_read,_write}()[1].

ver #2:
 - Add 9p foliation.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Tested-by: Jeff Layton <[email protected]>
Tested-by: Dominique Martinet <[email protected]>
Tested-by: [email protected]
cc: Matthew Wilcox (Oracle) <[email protected]>
cc: Marc Dionne <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Dominique Martinet <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/YYKa3bfQZxK5/[email protected]/ [1]
Link: https://lore.kernel.org/r/[email protected]/ # rfc
Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
digetx pushed a commit that referenced this pull request Nov 16, 2021
The exit function fixes a memory leak with the src field as detected by
leak sanitizer. An example of which is:

Indirect leak of 25133184 byte(s) in 207 object(s) allocated from:
    #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803
    #2 0x55defe6397e4 in symbol__hists util/annotate.c:952
    #3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968
    #4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119
    #5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182
    #6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236
    #7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315
    #8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473
    #9 0x55defe731e38 in machines__deliver_event util/session.c:1510
    #10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590
    #11 0x55defe72951e in ordered_events__deliver_event util/session.c:183
    #12 0x55defe740082 in do_flush util/ordered-events.c:244
    #13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323
    #14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341
    #15 0x55defe73837f in __perf_session__process_events util/session.c:2390
    #16 0x55defe7385ff in perf_session__process_events util/session.c:2420
    ...

Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Martin Liška <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
digetx pushed a commit that referenced this pull request Nov 18, 2021
When enabling quotas, we attempt to commit a transaction while holding the
mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
quota operations such as:

- qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;

- adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.

This is because these operations join a transaction and after that they
attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
after joining or starting a transaction is a pattern followed everywhere
in qgroups, so the quota enablement operation is the one at fault here,
and should not commit a transaction while holding that mutex.

Fix this by making the transaction commit while not holding the mutex.
We are safe from two concurrent tasks trying to enable quotas because
we are serialized by the rw semaphore fs_info->subvol_sem at
btrfs_ioctl_quota_ctl(), which is the only call site for enabling
quotas.

When this deadlock happens, it produces a trace like the following:

  INFO: task syz-executor:25604 blocked for more than 143 seconds.
  Not tainted 5.15.0-rc6 #4
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
  Call Trace:
  context_switch kernel/sched/core.c:4940 [inline]
  __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
  schedule+0xd3/0x270 kernel/sched/core.c:6366
  btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
  btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
  btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
  btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:874 [inline]
  __se_sys_ioctl fs/ioctl.c:860 [inline]
  __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7f86920b2c4d
  RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
  RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
  RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
  R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
  INFO: task syz-executor:25628 blocked for more than 143 seconds.
  Not tainted 5.15.0-rc6 #4
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
  Call Trace:
  context_switch kernel/sched/core.c:4940 [inline]
  __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
  schedule+0xd3/0x270 kernel/sched/core.c:6366
  schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
  __mutex_lock_common kernel/locking/mutex.c:669 [inline]
  __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
  btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
  btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
  btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:874 [inline]
  __se_sys_ioctl fs/ioctl.c:860 [inline]
  __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported-by: Hao Sun <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
Fixes: 340f1aa ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
CC: [email protected] # 4.19
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Nov 18, 2021
A race condition is triggered when usermode control is given to
userspace before the kernel's MSFT query responds, resulting in an
unexpected response to userspace's reset command.

Issue can be observed in btmon:

< HCI Command: Vendor (0x3f|0x001e) plen 2                    #3 [hci0]
        05 01                                            ..
@ USER Open: bt_stack_manage (privileged) version 2.22  {0x0002} [hci0]
< HCI Command: Reset (0x03|0x0003) plen 0                     #4 [hci0]
> HCI Event: Command Complete (0x0e) plen 5                   #5 [hci0]
      Vendor (0x3f|0x001e) ncmd 1
	Status: Command Disallowed (0x0c)
	05                                               .
> HCI Event: Command Complete (0x0e) plen 4                   #6 [hci0]
      Reset (0x03|0x0003) ncmd 2
	Status: Success (0x00)

Reviewed-by: Abhishek Pandit-Subedi <[email protected]>
Reviewed-by: Sonny Sasaka <[email protected]>
Signed-off-by: Jesse Melhuish <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
digetx pushed a commit that referenced this pull request Nov 18, 2021
[BUG]
The following script can cause btrfs to crash:

 mount -o compress-force=lzo $DEV /mnt
 dd if=/dev/urandom of=/mnt/foo bs=4k count=1
 sync

The calltrace looks like this:

 general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
 RIP: 0010:__memcpy+0x12/0x20
 Call Trace:
  lzo_compress_pages+0x236/0x540 [btrfs]
  btrfs_compress_pages+0xaa/0xf0 [btrfs]
  compress_file_range+0x431/0x8e0 [btrfs]
  async_cow_start+0x12/0x30 [btrfs]
  btrfs_work_helper+0xf6/0x3e0 [btrfs]
  process_one_work+0x294/0x5d0
  worker_thread+0x55/0x3c0
  kthread+0x140/0x170
  ret_from_fork+0x22/0x30
 ---[ end trace 63c3c0f131e61982 ]---

[CAUSE]
In lzo_compress_pages(), parameter @out_pages is not only an output
parameter (for the number of compressed pages), but also an input
parameter, as the upper limit of compressed pages we can utilize.

In commit d408880 ("btrfs: subpage: make lzo_compress_pages()
compatible"), the refactor doesn't take @out_pages as an input, thus
completely ignoring the limit.

And for compress-force case, we could hit incompressible data that
compressed size would go beyond the page limit, and cause above crash.

[FIX]
Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
and check if we're beyond the limit before accessing the pages.

Reported-by: Omar Sandoval <[email protected]>
Fixes: d408880 ("btrfs: subpage: make lzo_compress_pages() compatible")
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Nov 18, 2021
[BUG]
The following script can cause btrfs to crash:

  $ mount -o compress-force=lzo $DEV /mnt
  $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1
  $ sync

The call trace looks like this:

  general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:__memcpy+0x12/0x20
  Call Trace:
   lzo_compress_pages+0x236/0x540 [btrfs]
   btrfs_compress_pages+0xaa/0xf0 [btrfs]
   compress_file_range+0x431/0x8e0 [btrfs]
   async_cow_start+0x12/0x30 [btrfs]
   btrfs_work_helper+0xf6/0x3e0 [btrfs]
   process_one_work+0x294/0x5d0
   worker_thread+0x55/0x3c0
   kthread+0x140/0x170
   ret_from_fork+0x22/0x30
  ---[ end trace 63c3c0f131e61982 ]---

[CAUSE]
In lzo_compress_pages(), parameter @out_pages is not only an output
parameter (for the number of compressed pages), but also an input
parameter, as the upper limit of compressed pages we can utilize.

In commit d408880 ("btrfs: subpage: make lzo_compress_pages()
compatible"), the refactoring doesn't take @out_pages as an input, thus
completely ignoring the limit.

And for compress-force case, we could hit incompressible data that
compressed size would go beyond the page limit, and cause the above
crash.

[FIX]
Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
and check if we're beyond the limit before accessing the pages.

Note: this also fixes crash on 32bit architectures that was suspected to
be caused by merge of btrfs patches to 5.16-rc1. Reported in
https://lore.kernel.org/all/[email protected]/ .

Reported-by: Omar Sandoval <[email protected]>
Fixes: d408880 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-by: Omar Sandoval <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ add note ]
Signed-off-by: David Sterba <[email protected]>
digetx pushed a commit that referenced this pull request Jan 16, 2022
If the key is already present then free the key used for lookup.

Found with:
$ perf stat -M IO_Read_BW /bin/true

==1749112==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 32 byte(s) in 4 object(s) allocated from:
    #0 0x7f6f6fa7d7cf in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55acecd9d7a6 in check_per_pkg util/stat.c:343
    #2 0x55acecd9d9c5 in process_counter_values util/stat.c:365
    #3 0x55acecd9e0ab in process_counter_maps util/stat.c:421
    #4 0x55acecd9e292 in perf_stat_process_counter util/stat.c:443
    #5 0x55aceca8553e in read_counters ./tools/perf/builtin-stat.c:470
    #6 0x55aceca88fe3 in __run_perf_stat ./tools/perf/builtin-stat.c:1023
    #7 0x55aceca89146 in run_perf_stat ./tools/perf/builtin-stat.c:1048
    #8 0x55aceca90858 in cmd_stat ./tools/perf/builtin-stat.c:2555
    #9 0x55acecc05fa5 in run_builtin ./tools/perf/perf.c:313
    #10 0x55acecc064fe in handle_internal_command ./tools/perf/perf.c:365
    #11 0x55acecc068bb in run_argv ./tools/perf/perf.c:409
    #12 0x55acecc070aa in main ./tools/perf/perf.c:539

Reviewed-by: James Clark <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Poirier <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Clarke <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Vineet Singh <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
digetx pushed a commit that referenced this pull request Jan 16, 2022
Change the cifs filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in cifs.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

 (2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For cifs, I've made it render the volume name string as:

	"cifs,<ipaddress>,<sharename>"

     where the sharename has '/' characters replaced with ';'.

     This probably needs rethinking a bit as the total name could exceed
     the maximum filename component length.

     Further, the coherency data is currently just set to 0.  It needs
     something else doing with it - I wonder if it would suffice simply to
     sum the resource_id, vol_create_time and vol_serial_number or maybe
     hash them.

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

 (4) The functions to set/reset cookies are removed and
     fscache_use_cookie() and fscache_unuse_cookie() are used instead.

     fscache_use_cookie() is passed a flag to indicate if the cookie is
     opened for writing.  fscache_unuse_cookie() is passed updates for the
     metadata if we changed it (ie. if the file was opened for writing).

     These are called when the file is opened or closed.

 (5) cifs_setattr_*() are made to call fscache_resize() to change the size
     of the cache object.

 (6) The functions to read and write data are stubbed out pending a
     conversion to use netfslib.

Changes
=======
ver #7:
 - Removed the accidentally added-back call to get the super cookie in
   cifs_root_iget().
 - Fixed the right call to cifs_fscache_get_super_cookie() to take account
   of the "-o fsc" mount flag.

ver #6:
 - Moved the change of gfpflags_allow_blocking() to current_is_kswapd() for
   cifs here.
 - Fixed one of the error paths in cifs_atomic_open() to jump around the
   call to use the cookie.
 - Fixed an additional successful return in the middle of cifs_open() to
   use the cookie on the way out.
 - Only get a volume cookie (and thus inode cookies) when "-o fsc" is
   supplied to mount.

ver #5:
 - Fixed a couple of bits of cookie handling[2]:
   - The cookie should be released in cifs_evict_inode(), not
     cifsFileInfo_put_final().  The cookie needs to persist beyond file
     closure so that writepages will be able to write to it.
   - fscache_use_cookie() needs to be called in cifs_atomic_open() as it is
     for cifs_open().

ver #4:
 - Fixed the use of sizeof with memset.
 - tcon->vol_create_time is __le64 so doesn't need cpu_to_le64().

ver #3:
 - Canonicalise the cifs coherency data to make the cache portable.
 - Set volume coherency data.

ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - Upgraded to -rc4 to allow for upstream changes[1].
 - fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <[email protected]>
Acked-by: Jeff Layton <[email protected]>
cc: Steve French <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23b55d673d7527b093cd97b7c217c82e70cd1af0 [1]
Link: https://lore.kernel.org/r/[email protected]/ [2]
Link: https://lore.kernel.org/r/163819671009.215744.11230627184193298714.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906982979.143852.10672081929614953210.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967187187.1823006.247415138444991444.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021579335.640689.2681324337038770579.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/[email protected]/ # v5
Link: https://lore.kernel.org/r/[email protected]/ # v6
jonasschwoebel pushed a commit to Open-Surface-RT/grate-linux that referenced this pull request Oct 21, 2022
btrfs_can_activate_zone() can be called with the device_list_mutex already
held, which will lead to a deadlock:

insert_dev_extents() // Takes device_list_mutex
`-> insert_dev_extent()
 `-> btrfs_insert_empty_item()
  `-> btrfs_insert_empty_items()
   `-> btrfs_search_slot()
    `-> btrfs_cow_block()
     `-> __btrfs_cow_block()
      `-> btrfs_alloc_tree_block()
       `-> btrfs_reserve_extent()
        `-> find_free_extent()
         `-> find_free_extent_update_loop()
          `-> can_allocate_chunk()
           `-> btrfs_can_activate_zone() // Takes device_list_mutex again

As we're only traversing the list for reads we can switch from the
device_list_mutex to an RCU traversal of the list.

  [15.166572] WARNING: possible recursive locking detected
  [15.167117] 5.17.0-rc6-dennis grate-driver#79 Not tainted
  [15.167487] --------------------------------------------
  [15.167733] kworker/u8:3/146 is trying to acquire lock:
  [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
  [15.167733]
  [15.167733] but task is already holding lock:
  [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
  [15.167733]
  [15.167733] other info that might help us debug this:
  [15.167733]  Possible unsafe locking scenario:
  [15.167733]
  [15.171834]        CPU0
  [15.171834]        ----
  [15.171834]   lock(&fs_devs->device_list_mutex);
  [15.171834]   lock(&fs_devs->device_list_mutex);
  [15.171834]
  [15.171834]  *** DEADLOCK ***
  [15.171834]
  [15.171834]  May be due to missing lock nesting notation
  [15.171834]
  [15.171834] 5 locks held by kworker/u8:3/146:
  [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
  [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
  [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
  [15.176244]  grate-driver#3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
  [15.176244]  grate-driver#4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
  [15.179641]
  [15.179641] stack backtrace:
  [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis grate-driver#79
  [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
  [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  [15.179641] Call Trace:
  [15.179641]  <TASK>
  [15.179641]  dump_stack_lvl+0x45/0x59
  [15.179641]  __lock_acquire.cold+0x217/0x2b2
  [15.179641]  lock_acquire+0xbf/0x2b0
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  __mutex_lock+0x8e/0x970
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? lock_is_held_type+0xd7/0x130
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? _raw_spin_unlock+0x24/0x40
  [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
  [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
  [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
  [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
  [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
  [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
  [15.187601]  ? lock_is_held_type+0xd7/0x130
  [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
  [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
  [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
  [15.192037]  flush_space+0x374/0x600 [btrfs]
  [15.192037]  ? find_held_lock+0x2b/0x80
  [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
  [15.192037]  ? lock_release+0x131/0x2b0
  [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
  [15.192037]  process_one_work+0x24c/0x5a0
  [15.192037]  worker_thread+0x4a/0x3d0

Fixes: a85f05e ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
CC: [email protected] # 5.16+
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant