Merge pull request #1 from torvalds/master #100

mbin · 2014-06-18T01:10:21Z

pull from torvalds

enable_irq_wake() might fail, if so, we will see kernel warning in resume entries due to it always calls disable_irq_wake(). WARNING: at kernel/irq/manage.c:529 irq_set_irq_wake+0xc4/0xf0() Unbalanced IRQ 52 wake disable Modules linked in: ipv6 libcomposite configfs CPU: 0 PID: 1591 Comm: ash Tainted: G W 3.10.0-00854-gdbd86d4-dirty torvalds#100 (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) (show_stack+0x10/0x14) from (warn_slowpath_common+0x54/0x68) (warn_slowpath_common+0x54/0x68) from (warn_slowpath_fmt+0x30/0x40) (warn_slowpath_fmt+0x30/0x40) from (irq_set_irq_wake+0xc4/0xf0) (irq_set_irq_wake+0xc4/0xf0) from (sirfsoc_rtc_restore+0x30/0x38) (sirfsoc_rtc_restore+0x30/0x38) from (platform_pm_restore+0x2c/0x50) (platform_pm_restore+0x2c/0x50) from (dpm_run_callback.clone.6+0x30/0xb0) (dpm_run_callback.clone.6+0x30/0xb0) from (device_resume+0x88/0x134) (device_resume+0x88/0x134) from (dpm_resume+0x114/0x230) (dpm_resume+0x114/0x230) from (hibernation_snapshot+0x178/0x1d0) (hibernation_snapshot+0x178/0x1d0) from (hibernate+0x130/0x1dc) (hibernate+0x130/0x1dc) from (state_store+0xb4/0xc0) (state_store+0xb4/0xc0) from (kobj_attr_store+0x14/0x20) (kobj_attr_store+0x14/0x20) from (sysfs_write_file+0xfc/0x17c) (sysfs_write_file+0xfc/0x17c) from (vfs_write+0xc8/0x194) (vfs_write+0xc8/0x194) from (SyS_write+0x40/0x6c) (SyS_write+0x40/0x6c) from (ret_fast_syscall+0x0/0x30) To avoid unbalanced "IRQ wake disable", ensure that disable_irq_wake() is called only when enable_irq_wake() have been successfully enabled. Signed-off-by: Xianglong Du <[email protected]> Signed-off-by: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Turn it into (for example): [ 0.073380] x86: Booting SMP configuration: [ 0.074005] .... node #0, CPUs: #1 #2 #3 #4 #5 torvalds#6 torvalds#7 [ 0.603005] .... node #1, CPUs: torvalds#8 torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 [ 1.200005] .... node #2, CPUs: torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 [ 1.796005] .... node #3, CPUs: torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31 [ 2.393005] .... node #4, CPUs: torvalds#32 torvalds#33 torvalds#34 torvalds#35 torvalds#36 torvalds#37 torvalds#38 torvalds#39 [ 2.996005] .... node #5, CPUs: torvalds#40 torvalds#41 torvalds#42 torvalds#43 torvalds#44 torvalds#45 torvalds#46 torvalds#47 [ 3.600005] .... node torvalds#6, CPUs: torvalds#48 torvalds#49 torvalds#50 torvalds#51 #52 #53 torvalds#54 torvalds#55 [ 4.202005] .... node torvalds#7, CPUs: torvalds#56 torvalds#57 #58 torvalds#59 torvalds#60 torvalds#61 torvalds#62 torvalds#63 [ 4.811005] .... node torvalds#8, CPUs: torvalds#64 torvalds#65 torvalds#66 torvalds#67 torvalds#68 torvalds#69 #70 torvalds#71 [ 5.421006] .... node torvalds#9, CPUs: torvalds#72 torvalds#73 torvalds#74 torvalds#75 torvalds#76 torvalds#77 torvalds#78 torvalds#79 [ 6.032005] .... node torvalds#10, CPUs: torvalds#80 torvalds#81 torvalds#82 torvalds#83 torvalds#84 torvalds#85 torvalds#86 torvalds#87 [ 6.648006] .... node torvalds#11, CPUs: torvalds#88 torvalds#89 torvalds#90 torvalds#91 torvalds#92 torvalds#93 torvalds#94 torvalds#95 [ 7.262005] .... node torvalds#12, CPUs: torvalds#96 torvalds#97 torvalds#98 torvalds#99 torvalds#100 torvalds#101 torvalds#102 torvalds#103 [ 7.865005] .... node torvalds#13, CPUs: torvalds#104 torvalds#105 torvalds#106 torvalds#107 torvalds#108 torvalds#109 torvalds#110 torvalds#111 [ 8.466005] .... node torvalds#14, CPUs: torvalds#112 torvalds#113 torvalds#114 torvalds#115 torvalds#116 torvalds#117 torvalds#118 torvalds#119 [ 9.073006] .... node torvalds#15, CPUs: torvalds#120 torvalds#121 torvalds#122 torvalds#123 torvalds#124 torvalds#125 torvalds#126 torvalds#127 [ 9.679901] x86: Booted up 16 nodes, 128 CPUs and drop useless elements. Change num_digits() to hpa's division-avoiding, cell-phone-typed version which he went at great lengths and pains to submit on a Saturday evening. Signed-off-by: Borislav Petkov <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>

…-registers-v3 Working syspll get from registers v3

commit 66efdc7 upstream. snd_seq_timer_open() didn't catch the whole error path but let through if the timer id is a slave. This may lead to Oops by accessing the uninitialized pointer. BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 PGD 785cd067 PUD 76964067 PMD 0 Oops: 0002 [#4] SMP CPU 0 Pid: 4288, comm: trinity-child7 Tainted: G D W 3.9.0-rc1+ torvalds#100 Bochs Bochs RIP: 0010:[<ffffffff819b3477>] [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 RSP: 0018:ffff88006ece7d38 EFLAGS: 00010246 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007 FS: 00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290) Stack: 0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d 65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000 ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520 Call Trace: [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70 [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120 [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0 [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20 [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570 [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0 [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0 [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b Reported-and-tested-by: Tommi Rantala <[email protected]> Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 66efdc7 upstream. snd_seq_timer_open() didn't catch the whole error path but let through if the timer id is a slave. This may lead to Oops by accessing the uninitialized pointer. BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 PGD 785cd067 PUD 76964067 PMD 0 Oops: 0002 [#4] SMP CPU 0 Pid: 4288, comm: trinity-child7 Tainted: G D W 3.9.0-rc1+ torvalds#100 Bochs Bochs RIP: 0010:[<ffffffff819b3477>] [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 RSP: 0018:ffff88006ece7d38 EFLAGS: 00010246 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007 FS: 00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290) Stack: 0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d 65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000 ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520 Call Trace: [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70 [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120 [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0 [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20 [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570 [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0 [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0 [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b Reported-and-tested-by: Tommi Rantala <[email protected]> Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Ben Hutchings <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 66efdc7 upstream. snd_seq_timer_open() didn't catch the whole error path but let through if the timer id is a slave. This may lead to Oops by accessing the uninitialized pointer. BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 PGD 785cd067 PUD 76964067 PMD 0 Oops: 0002 [#4] SMP CPU 0 Pid: 4288, comm: trinity-child7 Tainted: G D W 3.9.0-rc1+ torvalds#100 Bochs Bochs RIP: 0010:[<ffffffff819b3477>] [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130 RSP: 0018:ffff88006ece7d38 EFLAGS: 00010246 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007 FS: 00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290) Stack: 0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d 65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000 ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520 Call Trace: [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70 [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120 [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0 [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20 [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570 [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0 [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0 [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b Reported-and-tested-by: Tommi Rantala <[email protected]> Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

…eckpatch-fixes WARNING: do not add new typedefs torvalds#86: FILE: arch/powerpc/include/asm/elf_util.h:35: +typedef unsigned long func_desc_t; WARNING: do not add new typedefs torvalds#90: FILE: arch/powerpc/include/asm/elf_util.h:39: +typedef struct ppc64_opd_entry func_desc_t; WARNING: Block comments use * on subsequent lines torvalds#94: FILE: arch/powerpc/include/asm/elf_util.h:43: +/* Like PPC32, we need little trampolines to do > 24-bit jumps (into + the kernel itself). But on PPC64, these need to be used for every WARNING: Block comments use a trailing */ on a separate line torvalds#95: FILE: arch/powerpc/include/asm/elf_util.h:44: + jump, actually, to reset r2 (TOC+0x8000). */ ERROR: open brace '{' following struct go on the same line torvalds#97: FILE: arch/powerpc/include/asm/elf_util.h:46: +struct ppc64_stub_entry +{ WARNING: Block comments use a trailing */ on a separate line torvalds#100: FILE: arch/powerpc/include/asm/elf_util.h:49: + * so we don't have to modify the trampoline load instruction. */ WARNING: Block comments use * on subsequent lines torvalds#110: FILE: arch/powerpc/include/asm/elf_util.h:59: +/* r2 is the TOC pointer: it actually points 0x8000 into the TOC (this + gives the value maximum span in an instruction which uses a signed WARNING: Block comments use a trailing */ on a separate line torvalds#111: FILE: arch/powerpc/include/asm/elf_util.h:60: + offset) */ WARNING: Block comments use * on subsequent lines torvalds#132: FILE: arch/powerpc/include/asm/module.h:18: +/* Both low and high 16 bits are added as SIGNED additions, so if low + 16 bits has high bit set, high 16 bits must be adjusted. These WARNING: Block comments use a trailing */ on a separate line torvalds#133: FILE: arch/powerpc/include/asm/module.h:19: + macros do that (stolen from binutils). */ WARNING: space prohibited between function name and open parenthesis '(' torvalds#136: FILE: arch/powerpc/include/asm/module.h:22: +#define PPC_HA(v) PPC_HI ((v) + 0x8000) ERROR: Macros with complex values should be enclosed in parentheses torvalds#136: FILE: arch/powerpc/include/asm/module.h:22: +#define PPC_HA(v) PPC_HI ((v) + 0x8000) WARNING: please, no spaces at the start of a line torvalds#210: FILE: arch/powerpc/kernel/elf_util_64.c:32: + (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) >> 2) << 2)$ WARNING: Block comments use a trailing */ on a separate line torvalds#216: FILE: arch/powerpc/kernel/elf_util_64.c:38: + * of function and try to derive r2 from it). */ WARNING: line over 80 characters torvalds#357: FILE: arch/powerpc/kernel/elf_util_64.c:179: + value = stub_for_addr(elf_info, value, obj_name); WARNING: line over 80 characters torvalds#363: FILE: arch/powerpc/kernel/elf_util_64.c:185: + squash_toc_save_inst(strtab + sym->st_name, value); ERROR: space required before the open brace '{' torvalds#369: FILE: arch/powerpc/kernel/elf_util_64.c:191: + if (value + 0x2000000 > 0x3ffffff || (value & 3) != 0){ WARNING: line over 80 characters torvalds#560: FILE: arch/powerpc/kernel/module_64.c:341: + sechdrs[me->arch.elf_info.stubs_section].sh_size = get_stubs_size(hdr, sechdrs); WARNING: line over 80 characters torvalds#613: FILE: arch/powerpc/kernel/module_64.c:380: + struct elf_shdr *stubs_sec = &elf_info->sechdrs[elf_info->stubs_section]; WARNING: line over 80 characters torvalds#889: FILE: arch/powerpc/kernel/module_64.c:498: + num_stubs = sechdrs[me->arch.elf_info.stubs_section].sh_size / sizeof(*entry); total: 3 errors, 17 warnings, 830 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/powerpc-factor-out-relocation-code-from-module_64c-to-elf_util_64c.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Thiago Jung Bauermann <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

On a ppc64 machine executing overlayfs/019 with xfs as the lower and upper filesystem causes the following call trace, WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420 Modules linked in: CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 torvalds#100 task: c000000631314880 task.stack: c0000003915d4000 NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660 REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405) MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI> CR: 24004284 XER: 00000000 CFAR: c0000000006f7190 SOFTE: 1 GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600 GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960 GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0 GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8 GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000 GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420 LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 Call Trace: [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable) [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0 [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420 [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160 [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130 [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0 [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0 [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100 [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc Instruction dump: 78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288 2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378 The above problem can also be recreated on a regular xfs filesystem using the command, $ fsstress -d /mnt -l 1000 -n 1000 -p 1000 The reason for the call trace is, 1. When 'reserving' blocks for delayed allocation , XFS reserves more blocks (i.e. past file's current EOF) than required. This is done because XFS assumes that userspace might write more data and hence 'reserving' more blocks might lead to the file's new data being stored contiguously on disk. 2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would then cover the prealloc-ed EOF blocks in addition to the regular blocks. 3. When flushing the dirty blocks to disk, we only flush data till the file's EOF. But before writing out the dirty data, we allocate blocks on the disk for holding the file's new data. This allocation includes the blocks that are part of the 'prealloc EOF blocks'. 4. Later, when the last reference to the inode is being closed, XFS frees the unused 'prealloc EOF blocks' in xfs_inactive(). In step 3 above, When allocating space on disk for the delayed allocation range, the space allocator might sometimes allocate less blocks than required. If such an allocation ends right at the current EOF of the file, We will not be able to clear the "delayed allocation" flag for the 'prealloc EOF blocks', since we won't have dirty buffer heads associated with that range of the file. In such a situation if a Direct I/O read operation is performed on file range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the range [X, Y] and invalidate page cache for that range (Refer to iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains the extent items (which are still cached in memory) for the file range. When doing so we are not supposed to get an extent item with IOMAP_DELALLOC flag set, since the previous "flush" operation should have converted any delayed allocation data in the range [X, Y]. Hence we end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor(). This commit fixes the bug by preventing the read operation from going beyond iomap_dio->i_size. Signed-off-by: Chandan Rajendra <[email protected]>

On a ppc64 machine executing overlayfs/019 with xfs as the lower and upper filesystem causes the following call trace, WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420 Modules linked in: CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 torvalds#100 task: c000000631314880 task.stack: c0000003915d4000 NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660 REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405) MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI> CR: 24004284 XER: 00000000 CFAR: c0000000006f7190 SOFTE: 1 GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600 GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960 GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0 GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8 GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000 GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420 LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 Call Trace: [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable) [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0 [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420 [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160 [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130 [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0 [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0 [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100 [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc Instruction dump: 78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288 2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378 The above problem can also be recreated on a regular xfs filesystem using the command, $ fsstress -d /mnt -l 1000 -n 1000 -p 1000 The reason for the call trace is, 1. When 'reserving' blocks for delayed allocation , XFS reserves more blocks (i.e. past file's current EOF) than required. This is done because XFS assumes that userspace might write more data and hence 'reserving' more blocks might lead to the file's new data being stored contiguously on disk. 2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would then cover the prealloc-ed EOF blocks in addition to the regular blocks. 3. When flushing the dirty blocks to disk, we only flush data till the file's EOF. But before writing out the dirty data, we allocate blocks on the disk for holding the file's new data. This allocation includes the blocks that are part of the 'prealloc EOF blocks'. 4. Later, when the last reference to the inode is being closed, XFS frees the unused 'prealloc EOF blocks' in xfs_inactive(). In step 3 above, When allocating space on disk for the delayed allocation range, the space allocator might sometimes allocate less blocks than required. If such an allocation ends right at the current EOF of the file, We will not be able to clear the "delayed allocation" flag for the 'prealloc EOF blocks', since we won't have dirty buffer heads associated with that range of the file. In such a situation if a Direct I/O read operation is performed on file range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the range [X, Y] and invalidate page cache for that range (Refer to iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains the extent items (which are still cached in memory) for the file range. When doing so we are not supposed to get an extent item with IOMAP_DELALLOC flag set, since the previous "flush" operation should have converted any delayed allocation data in the range [X, Y]. Hence we end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor(). This commit fixes the bug by preventing the read operation from going beyond iomap_dio->i_size. Reported-by: Santhosh G <[email protected]> Signed-off-by: Chandan Rajendra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

On a ppc64 machine executing overlayfs/019 with xfs as the lower and upper filesystem causes the following call trace, WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420 Modules linked in: CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 torvalds#100 task: c000000631314880 task.stack: c0000003915d4000 NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660 REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405) MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI> CR: 24004284 XER: 00000000 CFAR: c0000000006f7190 SOFTE: 1 GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600 GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960 GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0 GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8 GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000 GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420 LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 Call Trace: [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable) [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0 [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420 [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160 [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130 [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0 [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0 [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100 [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc Instruction dump: 78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288 2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378 The above problem can also be recreated on a regular xfs filesystem using the command, $ fsstress -d /mnt -l 1000 -n 1000 -p 1000 The reason for the call trace is, 1. When 'reserving' blocks for delayed allocation , XFS reserves more blocks (i.e. past file's current EOF) than required. This is done because XFS assumes that userspace might write more data and hence 'reserving' more blocks might lead to the file's new data being stored contiguously on disk. 2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would then cover the prealloc-ed EOF blocks in addition to the regular blocks. 3. When flushing the dirty blocks to disk, we only flush data till the file's EOF. But before writing out the dirty data, we allocate blocks on the disk for holding the file's new data. This allocation includes the blocks that are part of the 'prealloc EOF blocks'. 4. Later, when the last reference to the inode is being closed, XFS frees the unused 'prealloc EOF blocks' in xfs_inactive(). In step 3 above, When allocating space on disk for the delayed allocation range, the space allocator might sometimes allocate less blocks than required. If such an allocation ends right at the current EOF of the file, We will not be able to clear the "delayed allocation" flag for the 'prealloc EOF blocks', since we won't have dirty buffer heads associated with that range of the file. In such a situation if a Direct I/O read operation is performed on file range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the range [X, Y] and invalidate page cache for that range (Refer to iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains the extent items (which are still cached in memory) for the file range. When doing so we are not supposed to get an extent item with IOMAP_DELALLOC flag set, since the previous "flush" operation should have converted any delayed allocation data in the range [X, Y]. Hence we end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor(). This commit fixes the bug by preventing the read operation from going beyond iomap_dio->i_size. Reported-by: Santhosh G <[email protected]> Signed-off-by: Chandan Rajendra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>

BugLink: http://bugs.launchpad.net/bugs/1697955 commit a008c31 upstream. On a ppc64 machine executing overlayfs/019 with xfs as the lower and upper filesystem causes the following call trace, WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420 Modules linked in: CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 torvalds#100 task: c000000631314880 task.stack: c0000003915d4000 NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660 REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405) MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI> CR: 24004284 XER: 00000000 CFAR: c0000000006f7190 SOFTE: 1 GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600 GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960 GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0 GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8 GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000 GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420 LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 Call Trace: [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable) [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0 [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420 [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160 [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130 [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0 [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0 [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100 [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc Instruction dump: 78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288 2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378 The above problem can also be recreated on a regular xfs filesystem using the command, $ fsstress -d /mnt -l 1000 -n 1000 -p 1000 The reason for the call trace is, 1. When 'reserving' blocks for delayed allocation , XFS reserves more blocks (i.e. past file's current EOF) than required. This is done because XFS assumes that userspace might write more data and hence 'reserving' more blocks might lead to the file's new data being stored contiguously on disk. 2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would then cover the prealloc-ed EOF blocks in addition to the regular blocks. 3. When flushing the dirty blocks to disk, we only flush data till the file's EOF. But before writing out the dirty data, we allocate blocks on the disk for holding the file's new data. This allocation includes the blocks that are part of the 'prealloc EOF blocks'. 4. Later, when the last reference to the inode is being closed, XFS frees the unused 'prealloc EOF blocks' in xfs_inactive(). In step 3 above, When allocating space on disk for the delayed allocation range, the space allocator might sometimes allocate less blocks than required. If such an allocation ends right at the current EOF of the file, We will not be able to clear the "delayed allocation" flag for the 'prealloc EOF blocks', since we won't have dirty buffer heads associated with that range of the file. In such a situation if a Direct I/O read operation is performed on file range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the range [X, Y] and invalidate page cache for that range (Refer to iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains the extent items (which are still cached in memory) for the file range. When doing so we are not supposed to get an extent item with IOMAP_DELALLOC flag set, since the previous "flush" operation should have converted any delayed allocation data in the range [X, Y]. Hence we end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor(). This commit fixes the bug by preventing the read operation from going beyond iomap_dio->i_size. Reported-by: Santhosh G <[email protected]> Signed-off-by: Chandan Rajendra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Seth Forshee <[email protected]>

Rebase patches onto 4.13.8

ERROR: space prohibited after that '-' (ctx:WxW) torvalds#87: FILE: fs/hfs/hfs_fs.h:265: + ut -= - sys_tz.tz_minuteswest * 60; ^ WARNING: line over 80 characters torvalds#100: FILE: fs/hfs/hfs_fs.h:276: +#define hfs_m_to_utime(time) (struct timespec){ .tv_sec = __hfs_m_to_utime(time) } total: 1 errors, 1 warnings, 71 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/hfs-hfsplus-follow-macos-time-behavior.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]>

These NEON and non-NEON implementations come from Andy Polyakov's implementation. They are exactly the same as Andy Polyakov's original, with the following exceptions: - Entries and exits use the proper kernel convention macro. - CPU feature checking is done in C by the glue code, so that has been removed from the assembly. - The function names have been renamed to fit kernel conventions. - Labels have been renamed (prefixed with .L) to fit kernel conventions. - Constants have been rearranged so that they are closer to the code that is using them. [ARM only] - The neon code can jump to the scalar code when it makes sense to do so. - The neon_512 function as a separate function has been removed, leaving the decision up to the main neon entry point. [ARM64 only] After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual instructions from the original. ARM: -ChaCha20_ctr32: -.LChaCha20_ctr32: +ENTRY(chacha20_arm) ldr r12,[sp,#0] @ pull pointer to counter and nonce stmdb sp!,{r0-r2,r4-r11,lr} - sub r14,pc,torvalds#16 @ ChaCha20_ctr32 - adr r14,.LChaCha20_ctr32 cmp r2,#0 @ len==0? itt eq addeq sp,sp,#4*3 - beq .Lno_data - cmp r2,torvalds#192 @ test len - bls .Lshort - ldr r4,[r14,#-32] - ldr r4,[r14,r4] - ldr r4,[r4] - tst r4,#ARMV7_NEON - bne .LChaCha20_neon + beq .Lno_data_arm .Lshort: ldmia r12,{r4-r7} @ load counter and nonce sub sp,sp,#4*(16) @ off-load area - sub r14,r14,torvalds#64 @ .Lsigma + sub r14,pc,torvalds#100 @ .Lsigma + adr r14,.Lsigma @ .Lsigma stmdb sp!,{r4-r7} @ copy counter and nonce ldmia r3,{r4-r11} @ load key ldmia r14,{r0-r3} @ load sigma @@ -617,14 +615,25 @@ .Ldone: add sp,sp,#4*(32+3) -.Lno_data: +.Lno_data_arm: ldmia sp!,{r4-r11,pc} +ENDPROC(chacha20_arm) -ChaCha20_neon: +ENTRY(chacha20_neon) ldr r12,[sp,#0] @ pull pointer to counter and nonce stmdb sp!,{r0-r2,r4-r11,lr} -.LChaCha20_neon: - adr r14,.Lsigma + cmp r2,#0 @ len==0? + itt eq + addeq sp,sp,#4*3 + beq .Lno_data_neon + cmp r2,torvalds#192 @ test len + bls .Lshort +.Lchacha20_neon_begin: + adr r14,.Lsigma2 vstmdb sp!,{d8-d15} @ ABI spec says so stmdb sp!,{r0-r3} @@ -1265,4 +1274,6 @@ add sp,sp,#4*(32+4) vldmia sp,{d8-d15} add sp,sp,#4*(16+3) +.Lno_data_neon: ldmia sp!,{r4-r11,pc} +ENDPROC(chacha20_neon) ARM64: -ChaCha20_ctr32: +ENTRY(chacha20_arm) cbz x2,.Labort - adr x5,.LOPENSSL_armcap_P - cmp x2,torvalds#192 - b.lo .Lshort - ldrsw x6,[x5] - ldr x6,[x5] - ldr w17,[x6,x5] - tst w17,#ARMV7_NEON - b.ne ChaCha20_neon - .Lshort: stp x29,x30,[sp,#-96]! add x29,sp,#0 @@ -279,8 +274,13 @@ ldp x27,x28,[x29,torvalds#80] ldp x29,x30,[sp],torvalds#96 ret +ENDPROC(chacha20_arm) + +ENTRY(chacha20_neon) + cbz x2,.Labort_neon + cmp x2,torvalds#192 + b.lo .Lshort -ChaCha20_neon: stp x29,x30,[sp,#-96]! add x29,sp,#0 @@ -763,16 +763,6 @@ ldp x27,x28,[x29,torvalds#80] ldp x29,x30,[sp],torvalds#96 ret -ChaCha20_512_neon: - stp x29,x30,[sp,#-96]! - add x29,sp,#0 - - adr x5,.Lsigma - stp x19,x20,[sp,torvalds#16] - stp x21,x22,[sp,torvalds#32] - stp x23,x24,[sp,torvalds#48] - stp x25,x26,[sp,torvalds#64] - stp x27,x28,[sp,torvalds#80] .L512_or_more_neon: sub sp,sp,torvalds#128+64 @@ -1920,4 +1910,6 @@ ldp x25,x26,[x29,torvalds#64] ldp x27,x28,[x29,torvalds#80] ldp x29,x30,[sp],torvalds#96 +.Labort_neon: ret +ENDPROC(chacha20_neon) Signed-off-by: Jason A. Donenfeld <[email protected]> Cc: Samuel Neves <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Greg KH <[email protected]> Cc: Jean-Philippe Aumasson <[email protected]> Cc: Andy Polyakov <[email protected]> Cc: Russell King <[email protected]> Cc: [email protected]

If we drop the engine lock, we may run execlists_dequeue which may free the priolist. Therefore if we ever drop the execution lock on the engine, we have to discard our cache and refetch the priolist to ensure we do not use a stale pointer. [ 506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority [ 593.240825] general protection fault: 0000 [#1] SMP [ 593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G U 5.0.0-rc6+ torvalds#100 [ 593.240879] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016 [ 593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915] [ 593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85 [ 593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046 [ 593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0 [ 593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194 [ 593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840 [ 593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728 [ 593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158 [ 593.241120] FS: 00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000 [ 593.241133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0 [ 593.241158] Call Trace: [ 593.241233] i915_schedule+0x1f/0x30 [i915] [ 593.241326] i915_request_add+0x1a9/0x290 [i915] [ 593.241393] i915_gem_do_execbuffer+0x45f/0x1150 [i915] [ 593.241411] ? init_object+0x49/0x80 [ 593.241425] ? ___slab_alloc.constprop.91+0x4b8/0x4e0 [ 593.241491] ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915] [ 593.241563] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241629] i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915] [ 593.241705] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241724] drm_ioctl_kernel+0x81/0xd0 [ 593.241738] drm_ioctl+0x1a7/0x310 [ 593.241803] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241819] ? __update_load_avg_se+0x1c9/0x240 [ 593.241834] ? pick_next_entity+0x7e/0x120 [ 593.241851] do_vfs_ioctl+0x88/0x5d0 [ 593.241880] ksys_ioctl+0x35/0x70 [ 593.241894] __x64_sys_ioctl+0x11/0x20 [ 593.241907] do_syscall_64+0x44/0xf0 [ 593.241924] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 593.241940] RIP: 0033:0x7f4615ffe757 [ 593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48 [ 593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757 [ 593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003 [ 593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240 [ 593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [ 593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers v2: Track the local engine cache and explicitly clear it when switching engine locks. Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling") Testcase: igt/gem_exec_whisper/contexts-priority # rare! Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Michał Winiarski <[email protected]>

If we drop the engine lock, we may run execlists_dequeue which may free the priolist. Therefore if we ever drop the execution lock on the engine, we have to discard our cache and refetch the priolist to ensure we do not use a stale pointer. [ 506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority [ 593.240825] general protection fault: 0000 [#1] SMP [ 593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G U 5.0.0-rc6+ torvalds#100 [ 593.240879] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016 [ 593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915] [ 593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85 [ 593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046 [ 593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0 [ 593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194 [ 593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840 [ 593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728 [ 593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158 [ 593.241120] FS: 00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000 [ 593.241133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0 [ 593.241158] Call Trace: [ 593.241233] i915_schedule+0x1f/0x30 [i915] [ 593.241326] i915_request_add+0x1a9/0x290 [i915] [ 593.241393] i915_gem_do_execbuffer+0x45f/0x1150 [i915] [ 593.241411] ? init_object+0x49/0x80 [ 593.241425] ? ___slab_alloc.constprop.91+0x4b8/0x4e0 [ 593.241491] ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915] [ 593.241563] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241629] i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915] [ 593.241705] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241724] drm_ioctl_kernel+0x81/0xd0 [ 593.241738] drm_ioctl+0x1a7/0x310 [ 593.241803] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241819] ? __update_load_avg_se+0x1c9/0x240 [ 593.241834] ? pick_next_entity+0x7e/0x120 [ 593.241851] do_vfs_ioctl+0x88/0x5d0 [ 593.241880] ksys_ioctl+0x35/0x70 [ 593.241894] __x64_sys_ioctl+0x11/0x20 [ 593.241907] do_syscall_64+0x44/0xf0 [ 593.241924] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 593.241940] RIP: 0033:0x7f4615ffe757 [ 593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48 [ 593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757 [ 593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003 [ 593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240 [ 593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [ 593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers v2: Track the local engine cache and explicitly clear it when switching engine locks. Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling") Testcase: igt/gem_exec_whisper/contexts-priority # rare! Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Michał Winiarski <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]>

If we drop the engine lock, we may run execlists_dequeue which may free the priolist. Therefore if we ever drop the execution lock on the engine, we have to discard our cache and refetch the priolist to ensure we do not use a stale pointer. [ 506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority [ 593.240825] general protection fault: 0000 [#1] SMP [ 593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G U 5.0.0-rc6+ torvalds#100 [ 593.240879] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016 [ 593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915] [ 593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85 [ 593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046 [ 593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0 [ 593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194 [ 593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840 [ 593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728 [ 593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158 [ 593.241120] FS: 00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000 [ 593.241133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0 [ 593.241158] Call Trace: [ 593.241233] i915_schedule+0x1f/0x30 [i915] [ 593.241326] i915_request_add+0x1a9/0x290 [i915] [ 593.241393] i915_gem_do_execbuffer+0x45f/0x1150 [i915] [ 593.241411] ? init_object+0x49/0x80 [ 593.241425] ? ___slab_alloc.constprop.91+0x4b8/0x4e0 [ 593.241491] ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915] [ 593.241563] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241629] i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915] [ 593.241705] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241724] drm_ioctl_kernel+0x81/0xd0 [ 593.241738] drm_ioctl+0x1a7/0x310 [ 593.241803] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241819] ? __update_load_avg_se+0x1c9/0x240 [ 593.241834] ? pick_next_entity+0x7e/0x120 [ 593.241851] do_vfs_ioctl+0x88/0x5d0 [ 593.241880] ksys_ioctl+0x35/0x70 [ 593.241894] __x64_sys_ioctl+0x11/0x20 [ 593.241907] do_syscall_64+0x44/0xf0 [ 593.241924] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 593.241940] RIP: 0033:0x7f4615ffe757 [ 593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48 [ 593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757 [ 593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003 [ 593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240 [ 593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [ 593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling") Testcase: igt/gem_exec_whisper/contexts-priority # rare! Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Michał Winiarski <[email protected]>

If we drop the engine lock, we may run execlists_dequeue which may free the priolist. Therefore if we ever drop the execution lock on the engine, we have to discard our cache and refetch the priolist to ensure we do not use a stale pointer. [ 506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority [ 593.240825] general protection fault: 0000 [rib#1] SMP [ 593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G U 5.0.0-rc6+ torvalds#100 [ 593.240879] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016 [ 593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915] [ 593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85 [ 593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046 [ 593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0 [ 593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194 [ 593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840 [ 593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728 [ 593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158 [ 593.241120] FS: 00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000 [ 593.241133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0 [ 593.241158] Call Trace: [ 593.241233] i915_schedule+0x1f/0x30 [i915] [ 593.241326] i915_request_add+0x1a9/0x290 [i915] [ 593.241393] i915_gem_do_execbuffer+0x45f/0x1150 [i915] [ 593.241411] ? init_object+0x49/0x80 [ 593.241425] ? ___slab_alloc.constprop.91+0x4b8/0x4e0 [ 593.241491] ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915] [ 593.241563] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241629] i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915] [ 593.241705] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241724] drm_ioctl_kernel+0x81/0xd0 [ 593.241738] drm_ioctl+0x1a7/0x310 [ 593.241803] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241819] ? __update_load_avg_se+0x1c9/0x240 [ 593.241834] ? pick_next_entity+0x7e/0x120 [ 593.241851] do_vfs_ioctl+0x88/0x5d0 [ 593.241880] ksys_ioctl+0x35/0x70 [ 593.241894] __x64_sys_ioctl+0x11/0x20 [ 593.241907] do_syscall_64+0x44/0xf0 [ 593.241924] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 593.241940] RIP: 0033:0x7f4615ffe757 [ 593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48 [ 593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757 [ 593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003 [ 593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240 [ 593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [ 593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers v2: Track the local engine cache and explicitly clear it when switching engine locks. Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling") Testcase: igt/gem_exec_whisper/contexts-priority # rare! Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Michał Winiarski <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

If we drop the engine lock, we may run execlists_dequeue which may free the priolist. Therefore if we ever drop the execution lock on the engine, we have to discard our cache and refetch the priolist to ensure we do not use a stale pointer. [ 506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority [ 593.240825] general protection fault: 0000 [#1] SMP [ 593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G U 5.0.0-rc6+ torvalds#100 [ 593.240879] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016 [ 593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915] [ 593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85 [ 593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046 [ 593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0 [ 593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194 [ 593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840 [ 593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728 [ 593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158 [ 593.241120] FS: 00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000 [ 593.241133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0 [ 593.241158] Call Trace: [ 593.241233] i915_schedule+0x1f/0x30 [i915] [ 593.241326] i915_request_add+0x1a9/0x290 [i915] [ 593.241393] i915_gem_do_execbuffer+0x45f/0x1150 [i915] [ 593.241411] ? init_object+0x49/0x80 [ 593.241425] ? ___slab_alloc.constprop.91+0x4b8/0x4e0 [ 593.241491] ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915] [ 593.241563] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241629] i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915] [ 593.241705] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241724] drm_ioctl_kernel+0x81/0xd0 [ 593.241738] drm_ioctl+0x1a7/0x310 [ 593.241803] ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915] [ 593.241819] ? __update_load_avg_se+0x1c9/0x240 [ 593.241834] ? pick_next_entity+0x7e/0x120 [ 593.241851] do_vfs_ioctl+0x88/0x5d0 [ 593.241880] ksys_ioctl+0x35/0x70 [ 593.241894] __x64_sys_ioctl+0x11/0x20 [ 593.241907] do_syscall_64+0x44/0xf0 [ 593.241924] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 593.241940] RIP: 0033:0x7f4615ffe757 [ 593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48 [ 593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757 [ 593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003 [ 593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240 [ 593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [ 593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers v2: Track the local engine cache and explicitly clear it when switching engine locks. Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling") Testcase: igt/gem_exec_whisper/contexts-priority # rare! Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Michał Winiarski <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit ed7dc67) Signed-off-by: Rodrigo Vivi <[email protected]>

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Fixes: f1a0a37 ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Nathan Lynch <[email protected]>

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Valentin Schneider <[email protected]>

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Race condition of page table update can happen in kernel_init as both of memory hotplug module init and the following mark_rodata_ro can update page table. The function excute flow chart is: ------------------------- kernel_init kernel_init_freeable ... do_initcall ... module_init [A] ... mark_readonly mark_rodata_ro [B] ------------------------- [A] can contains memory hotplug init therefore both [A] and [B] can update page table at the same time that may lead to race. Here we introduce memory hotplug lock to guard mark_rodata_ro to avoid the race condition. I catch the related error when test virtio-mem (a new memory hotplug driver) on arm64 and may be a potential bug for other arches. How to reproduce on arm64: (1) prepare a kernel with virtio-mem enabled on arm64 (2) start a VM using Cloud Hypervisor[1] using the kernel above (3) hotplug memory, 20G in my case, with virtio-mem (4) reboot or load new kernel using kexec Test for server times, you may find the error below: [ 1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140 [ 1.134504] Mem abort info: [ 1.135722] ESR = 0x96000007 [ 1.136991] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.139189] SET = 0, FnV = 0 [ 1.140467] EA = 0, S1PTW = 0 [ 1.141755] FSC = 0x07: level 3 translation fault [ 1.143787] Data abort info: [ 1.144976] ISV = 0, ISS = 0x00000007 [ 1.146554] CM = 0, WnR = 0 [ 1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000 [ 1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000 [ 1.155728] Internal error: Oops: 96000007 [#1] SMP [ 1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G C 5.15.0-rc3+ torvalds#100 [ 1.161002] Hardware name: linux,dummy-virt (DT) [ 1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.165825] pc : alloc_init_pud+0x38c/0x550 [ 1.167610] lr : alloc_init_pud+0x394/0x550 [ 1.169358] sp : ffff80001001bd10 ...... [ 1.200527] Call trace: [ 1.201583] alloc_init_pud+0x38c/0x550 [ 1.203218] __create_pgd_mapping+0x94/0xe0 [ 1.204983] update_mapping_prot+0x50/0xd8 [ 1.206730] mark_rodata_ro+0x50/0x58 [ 1.208281] kernel_init+0x3c/0x120 [ 1.209760] ret_from_fork+0x10/0x20 [ 1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1) [ 1.213856] ---[ end trace 59473413ffe3f52d ]--- [ 1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [1] https://github.com/cloud-hypervisor/cloud-hypervisor Suggested-by: Anshuman Khandual <[email protected]> Signed-off-by: Jianyong Wu <[email protected]>

Race condition of page table update can happen in kernel boot period as both of memory hotplug in kernel init and the following mark_rodata_ro can update page table. For virtio-mem, the function excute flow chart is: ------------------------- kernel_init kernel_init_freeable ... do_initcall ... module_init [A] ... mark_readonly mark_rodata_ro [B] ------------------------- virtio-mem can be initialized at [A] and spwan a workqueue to add memory, therefore the race of update page table can happen inside [B]. What's more, the race condition can happen even for ACPI based memory hotplug, as it can burst into kernel boot period while page table is updating inside mark_rodata_ro. That's why memory hotplug lock is needed to guard mark_rodata_ro to avoid the race condition. It may only happen in arm64. As fixmap, which is the global resource, is used in page table creating. So, the change is only for arm64. The error often occurs inside alloc_init_pud() in arch/arm64/mm/mmu.c the race condition flow is: *************** begin ************ kerenl_init virtio-mem workqueue ========= ======== alloc_init_pud(...) pudp = pud_set_fixmap_offset(..) alloc_init_pud(...) ... ... READ_ONCE(*pudp) //OK! pudp = pud_set_fixmap_offset( ... ... pud_clear_fixmap() //fixmap break READ_ONCE(*pudp) //CRASH! **************** end ************* I catch the related error when test virtio-mem (a new memory hotplug driver) on arm64. How to reproduce: (1) prepare a kernel with virtio-mem enabled on arm64 (2) start a VM using Cloud Hypervisor using the kernel above (3) hotplug memory, 20G in my case, with virtio-mem (4) reboot or start a new kernel using kexec Test for server times, you may find the error below: [ 1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140 [ 1.134504] Mem abort info: [ 1.135722] ESR = 0x96000007 [ 1.136991] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.139189] SET = 0, FnV = 0 [ 1.140467] EA = 0, S1PTW = 0 [ 1.141755] FSC = 0x07: level 3 translation fault [ 1.143787] Data abort info: [ 1.144976] ISV = 0, ISS = 0x00000007 [ 1.146554] CM = 0, WnR = 0 [ 1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000 [ 1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000 [ 1.155728] Internal error: Oops: 96000007 [cloud-hypervisor#1] SMP [ 1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G C 5.15.0-rc3+ torvalds#100 [ 1.161002] Hardware name: linux,dummy-virt (DT) [ 1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.165825] pc : alloc_init_pud+0x38c/0x550 [ 1.167610] lr : alloc_init_pud+0x394/0x550 [ 1.169358] sp : ffff80001001bd10 ...... [ 1.200527] Call trace: [ 1.201583] alloc_init_pud+0x38c/0x550 [ 1.203218] __create_pgd_mapping+0x94/0xe0 [ 1.204983] update_mapping_prot+0x50/0xd8 [ 1.206730] mark_rodata_ro+0x50/0x58 [ 1.208281] kernel_init+0x3c/0x120 [ 1.209760] ret_from_fork+0x10/0x20 [ 1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1) [ 1.213856] ---[ end trace 59473413ffe3f52d ]--- [ 1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b We can see that the error derived from the l3 translation as the pte value is *0*. That is because the fixmap has been clear when access. Signed-off-by: Jianyong Wu <[email protected]>

Race condition of page table update can happen in kernel boot period as both of memory hotplug action when kernel init and the mark_rodata_ro can update page table. For virtio-mem, the function excute flow chart is: ------------------------- kernel_init kernel_init_freeable ... do_initcall ... module_init [A] ... mark_readonly mark_rodata_ro [B] ------------------------- virtio-mem can be initialized at [A] and spwan a workqueue to add memory, therefore the race of update page table can happen inside [B]. What's more, the race condition can happen even for ACPI based memory hotplug, as it can burst into kernel boot period while page table is updating inside mark_rodata_ro. That's why memory hotplug lock is needed to guard mark_rodata_ro to avoid the race condition. It may only happen in arm64. As fixmap, which is the global resource, is used in page table creating. So, the change is only for arm64. The error often occurs inside alloc_init_pud() in arch/arm64/mm/mmu.c the race condition flow is: *************** begin ************ kerenl_init virtio-mem workqueue ========= ======== alloc_init_pud(...) pudp = pud_set_fixmap_offset(..) alloc_init_pud(...) ... ... READ_ONCE(*pudp) //OK! pudp = pud_set_fixmap_offset( ... ... pud_clear_fixmap() //fixmap break READ_ONCE(*pudp) //CRASH! **************** end ************* I catch the related error when test virtio-mem (a new memory hotplug driver) on arm64. How to reproduce: (1) prepare a kernel with virtio-mem enabled on arm64 (2) start a VM using Cloud Hypervisor using the kernel above (3) hotplug memory, 20G in my case, with virtio-mem (4) reboot or start a new kernel using kexec Test for server times, you may find the error below: [ 1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140 [ 1.134504] Mem abort info: [ 1.135722] ESR = 0x96000007 [ 1.136991] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.139189] SET = 0, FnV = 0 [ 1.140467] EA = 0, S1PTW = 0 [ 1.141755] FSC = 0x07: level 3 translation fault [ 1.143787] Data abort info: [ 1.144976] ISV = 0, ISS = 0x00000007 [ 1.146554] CM = 0, WnR = 0 [ 1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000 [ 1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000 [ 1.155728] Internal error: Oops: 96000007 [#1] SMP [ 1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G C 5.15.0-rc3+ torvalds#100 [ 1.161002] Hardware name: linux,dummy-virt (DT) [ 1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.165825] pc : alloc_init_pud+0x38c/0x550 [ 1.167610] lr : alloc_init_pud+0x394/0x550 [ 1.169358] sp : ffff80001001bd10 ...... [ 1.200527] Call trace: [ 1.201583] alloc_init_pud+0x38c/0x550 [ 1.203218] __create_pgd_mapping+0x94/0xe0 [ 1.204983] update_mapping_prot+0x50/0xd8 [ 1.206730] mark_rodata_ro+0x50/0x58 [ 1.208281] kernel_init+0x3c/0x120 [ 1.209760] ret_from_fork+0x10/0x20 [ 1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1) [ 1.213856] ---[ end trace 59473413ffe3f52d ]--- [ 1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b We can see that the error derived from the l3 translation as the pte value is *0*. That is because the fixmap has been clear when access. Signed-off-by: Jianyong Wu <[email protected]>

[ Upstream commit 787252a ] With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>

Fix the following checkpatch complaints: ERROR: code indent should use tabs where possible torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43: + return 0;$ WARNING: please, no spaces at the start of a line torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43: + return 0;$ ERROR: code indent should use tabs where possible torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46: + .set = print_debug_set,$ WARNING: please, no spaces at the start of a line torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46: + .set = print_debug_set,$ ERROR: code indent should use tabs where possible torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47: + .get = param_get_int,$ WARNING: please, no spaces at the start of a line torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47: + .get = param_get_int,$ ERROR: code indent should use tabs where possible torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43: + return 0;$ WARNING: please, no spaces at the start of a line torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43: + return 0;$ ERROR: code indent should use tabs where possible torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46: + .set = print_debug_set,$ WARNING: please, no spaces at the start of a line torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46: + .set = print_debug_set,$ ERROR: code indent should use tabs where possible torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47: + .get = param_get_int,$ WARNING: please, no spaces at the start of a line torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47: + .get = param_get_int,$ Signed-off-by: Joe Lawrence <[email protected]>

[ Upstream commit 787252a ] With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a37 ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>

Fix the following checkpatch complaints: ERROR: code indent should use tabs where possible torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43: + return 0;$ WARNING: please, no spaces at the start of a line torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43: + return 0;$ ERROR: code indent should use tabs where possible torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46: + .set = print_debug_set,$ WARNING: please, no spaces at the start of a line torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46: + .set = print_debug_set,$ ERROR: code indent should use tabs where possible torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47: + .get = param_get_int,$ WARNING: please, no spaces at the start of a line torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47: + .get = param_get_int,$ ERROR: code indent should use tabs where possible torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43: + return 0;$ WARNING: please, no spaces at the start of a line torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43: + return 0;$ ERROR: code indent should use tabs where possible torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46: + .set = print_debug_set,$ WARNING: please, no spaces at the start of a line torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46: + .set = print_debug_set,$ ERROR: code indent should use tabs where possible torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47: + .get = param_get_int,$ WARNING: please, no spaces at the start of a line torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47: + .get = param_get_int,$ Signed-off-by: Joe Lawrence <[email protected]>

Fix crash consistency issue with alternate logs (torvalds#100)

Some use-cases and/or data patterns may benefit from larger zspages. Currently the limit on the number of physical pages that are linked into a zspage is hardcoded to 4. Higher limit changes key characteristics of a number of the size clases, improving compactness of the pool and redusing the amount of memory zsmalloc pool uses. For instance, the huge size class watermark is currently set to 3264 bytes. With order 3 zspages we have more normal classe and huge size watermark becomes 3632. With order 4 zspages huge size watermark becomes 3840. Commit #1 has more numbers and some analysis. This patch (of 6): zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. We move huge class watermark with higher order zspages. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== 1) ChromeOS memory pressure test ----------------------------------------------------------------------------- Our standard memory pressure test, that is designed with the reproducibility in mind. zram is configured as a swap device, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted device. Columns per (Documentation/admin-guide/blockdev/zram.rst) orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) 10353639424 2981711944 3166896128 0 3543158784 579494 825135 123707 10168573952 2932288347 3106541568 0 3499085824 565187 853137 126153 9950461952 2815911234 3035693056 0 3441090560 586696 748054 122103 9892335616 2779566152 2943459328 0 3514736640 591541 650696 119621 9993949184 2814279212 3021357056 0 3336421376 582488 711744 121273 9953226752 2856382009 3025649664 0 3512893440 564559 787861 123034 9838448640 2785481728 2997575680 0 3367219200 573282 777099 122739 ORDER 3 9509138432 2706941227 2823393280 0 3389587456 535856 1011472 90223 10105245696 2882368370 3013095424 0 3296165888 563896 1059033 94808 9531236352 2666125512 2867650560 0 3396173824 567117 1126396 88807 9561812992 2714536764 2956652544 0 3310505984 548223 827322 90992 9807470592 2790315707 2908053504 0 3378315264 563670 1020933 93725 10178371584 2948838782 3071209472 0 3329548288 548533 954546 90730 9925165056 2849839413 2958274560 0 3336978432 551464 1058302 89381 ORDER 4 9444515840 2613362645 2668232704 0 3396759552 573735 1162207 83475 10129108992 2925888488 3038351360 0 3499597824 555634 1231542 84525 9876594688 2786692282 2897006592 0 3469463552 584835 1290535 84133 10012909568 2649711847 2801512448 0 3171323904 675405 750728 80424 10120966144 2866742402 2978639872 0 3257815040 587435 1093981 83587 9578790912 2671245225 2802270208 0 3376353280 545548 1047930 80895 10108588032 2888433523 2983960576 0 3316641792 571445 1290640 81402 First, we establish that order 3 and 4 don't cause any statistically significant change in `orig_data_size` (number of bytes we store during the test), in other words larger zspages don't cause regressions. T-test for order 3: x order-2-stored + order-3-stored +-----------------------------------------------------------------------------+ |+ + + + x x + x x + x+ x| | |________________________AM__|_________M_____A____|__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08 No difference proven at 95.0% confidence T-test for order 4: x order-2-stored + order-4-stored +-----------------------------------------------------------------------------+ | + | |+ + x +x xx x + ++ x x| | |__________________|____A____M____M____________|_| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.4445158e+09 1.0129109e+10 1.001291e+10 9.8959249e+09 2.7947784e+08 No difference proven at 95.0% confidence Next we establish that there is a statistically significant improvement in `mem_used_total` metrics. T-test for order 3: x order-2-usedmem + order-3-usedmem +-----------------------------------------------------------------------------+ |+ + + x ++ x + xx x + x x| | |_________________A__M__|____________|__A________________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09 84630851 Difference at 95.0% confidence -9.98347e+07 +/- 9.21744e+07 -3.28139% +/- 3.02961% (Student's t, pooled s = 7.91383e+07) T-test for order 4: x order-2-usedmem + order-4-usedmem +-----------------------------------------------------------------------------+ | + x | |+ + + x ++ x x * x x| | |__________________A__M__________|_____|_M__A__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08 Difference at 95.0% confidence -1.61028e+08 +/- 1.23591e+08 -5.29272% +/- 4.0622% (Student's t, pooled s = 1.06111e+08) Order 3 zspages also show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +-----------------------------------------------------------------------------+ |+ + + x+ x + + + x x x x| | |________M__A_________|_|_____________________A___________M____________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09 39840377 Difference at 95.0% confidence -1.11047e+08 +/- 7.36589e+07 -3.21017% +/- 2.12934% (Student's t, pooled s = 6.32415e+07) Order 4 zspages, on the other hand, do not show any statistically significant improvement in `mem_used_max` metrics. T-test for order 4: x order-2-maxmem + order-4-maxmem +-----------------------------------------------------------------------------+ |+ + + x x + + x + * x x| | |_______________________A___M________________A_|_____M_______| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08 No difference proven at 95.0% confidence Overall, with sufficient level of confidence order 3 zspages appear to be beneficial for these particular use-case and data patterns. Rather expectedly we also observed lower numbers of huge-pages when zsmalloc is configured with order 3 and order 4 zspages, for the reason already explained. 2) Synthetic test ----------------------------------------------------------------------------- Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) 1691807744 628091753 655187968 0 655187968 59 0 34042 34043 1691803648 628089105 655159296 0 655159296 60 0 34043 34043 1691795456 628087429 655151104 0 655151104 59 0 34046 34046 1691799552 628093723 655216640 0 655216640 60 0 34044 34044 ORDER 3 1691787264 627781464 641740800 0 641740800 59 0 33591 33591 1691795456 627794239 641789952 0 641789952 59 0 33591 33591 1691811840 627788466 641691648 0 641691648 60 0 33591 33591 1691791360 627790682 641781760 0 641781760 59 0 33591 33591 ORDER 4 1691807744 627729506 639627264 0 639627264 59 0 33432 33432 1691820032 627731485 639606784 0 639606784 59 0 33432 33432 1691799552 627725753 639623168 0 639623168 59 0 33432 33433 1691820032 627734080 639746048 0 639746048 61 0 33432 33432 Order 3 and order 4 show statistically significant improvement in `mem_used_total` metrics. T-test for order 3: x order-2-usedmem-comp + order-3-usedmem-comp +-----------------------------------------------------------------------------+ |++ x| |++ x| |AM A| +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08 29795.878 + 4 6.4169165e+08 6.4178995e+08 6.4178176e+08 6.4175104e+08 45056 Difference at 95.0% confidence -1.34277e+07 +/- 66089.8 -2.04947% +/- 0.0100873% (Student's t, pooled s = 38195.8) T-test for order 4: x order-2-usedmem-comp + order-4-usedmem-comp +-----------------------------------------------------------------------------+ |+ x| |+ x| |++ x| |A| A| +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08 29795.878 + 4 6.3960678e+08 6.3974605e+08 6.3962726e+08 6.3965082e+08 64101.637 Difference at 95.0% confidence -1.55279e+07 +/- 86486.9 -2.37003% +/- 0.0132005% (Student's t, pooled s = 49984.1) Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem-comp + order-3-maxmem-comp +-----------------------------------------------------------------------------+ |++ x| |++ x| |AM A| +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08 29795.878 + 4 6.4169165e+08 6.4178995e+08 6.4178176e+08 6.4175104e+08 45056 Difference at 95.0% confidence -1.34277e+07 +/- 66089.8 -2.04947% +/- 0.0100873% (Student's t, pooled s = 38195.8) T-test for order 4: x order-2-maxmem-comp + order-4-maxmem-comp +-----------------------------------------------------------------------------+ |+ x| |+ x| |++ x| |A| A| +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 4 6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08 29795.878 + 4 6.3960678e+08 6.3974605e+08 6.3962726e+08 6.3965082e+08 64101.637 Difference at 95.0% confidence -1.55279e+07 +/- 86486.9 -2.37003% +/- 0.0132005% (Student's t, pooled s = 49984.1) This test tends to benefit more from order 4 zspages, due to test's data patterns. Data patterns that generate a considerable number of badly compressible objects benefit from higher `huge_class_size` watermark, which is achieved with order 4 zspages. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== 1) ChromeOS memory pressure test ============================================================================= Our standard memory pressure test, that is designed with reproducibility in mind. zram is configured as a swap device, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted device. Columns per (Documentation/admin-guide/blockdev/zram.rst) orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 10353639424 2981711944 3166896128 0 3543158784 579494 825135 123707 10168573952 2932288347 3106541568 0 3499085824 565187 853137 126153 9950461952 2815911234 3035693056 0 3441090560 586696 748054 122103 9892335616 2779566152 2943459328 0 3514736640 591541 650696 119621 9993949184 2814279212 3021357056 0 3336421376 582488 711744 121273 9953226752 2856382009 3025649664 0 3512893440 564559 787861 123034 9838448640 2785481728 2997575680 0 3367219200 573282 777099 122739 ORDER 3 zspage 9509138432 2706941227 2823393280 0 3389587456 535856 1011472 90223 10105245696 2882368370 3013095424 0 3296165888 563896 1059033 94808 9531236352 2666125512 2867650560 0 3396173824 567117 1126396 88807 9561812992 2714536764 2956652544 0 3310505984 548223 827322 90992 9807470592 2790315707 2908053504 0 3378315264 563670 1020933 93725 10178371584 2948838782 3071209472 0 3329548288 548533 954546 90730 9925165056 2849839413 2958274560 0 3336978432 551464 1058302 89381 ORDER 4 zspage 9444515840 2613362645 2668232704 0 3396759552 573735 1162207 83475 10129108992 2925888488 3038351360 0 3499597824 555634 1231542 84525 9876594688 2786692282 2897006592 0 3469463552 584835 1290535 84133 10012909568 2649711847 2801512448 0 3171323904 675405 750728 80424 10120966144 2866742402 2978639872 0 3257815040 587435 1093981 83587 9578790912 2671245225 2802270208 0 3376353280 545548 1047930 80895 10108588032 2888433523 2983960576 0 3316641792 571445 1290640 81402 First, we establish that order 3 and 4 don't cause any statistically significant change in `orig_data_size` (number of bytes we store during the test), in other words larger zspages don't cause regressions. T-test for order 3: x order-2-stored + order-3-stored +-----------------------------------------------------------------------------+ |+ + + + x x + x x + x+ x| | |________________________AM__|_________M_____A____|__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08 No difference proven at 95.0% confidence T-test for order 4: x order-2-stored + order-4-stored +-----------------------------------------------------------------------------+ | + | |+ + x +x xx x + ++ x x| | |__________________|____A____M____M____________|_| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.4445158e+09 1.0129109e+10 1.001291e+10 9.8959249e+09 2.7947784e+08 No difference proven at 95.0% confidence Next we establish that there is a statistically significant improvement in `mem_used_total` metrics. T-test for order 3: x order-2-usedmem + order-3-usedmem +-----------------------------------------------------------------------------+ |+ + + x ++ x + xx x + x x| | |_________________A__M__|____________|__A________________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09 84630851 Difference at 95.0% confidence -9.98347e+07 +/- 9.21744e+07 -3.28139% +/- 3.02961% (Student's t, pooled s = 7.91383e+07) T-test for order 4: x order-2-usedmem + order-4-usedmem +-----------------------------------------------------------------------------+ | + x | |+ + + x ++ x x * x x| | |__________________A__M__________|_____|_M__A__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08 Difference at 95.0% confidence -1.61028e+08 +/- 1.23591e+08 -5.29272% +/- 4.0622% (Student's t, pooled s = 1.06111e+08) Order 3 zspages also show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +-----------------------------------------------------------------------------+ |+ + + x+ x + + + x x x x| | |________M__A_________|_|_____________________A___________M____________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09 39840377 Difference at 95.0% confidence -1.11047e+08 +/- 7.36589e+07 -3.21017% +/- 2.12934% (Student's t, pooled s = 6.32415e+07) Order 4 zspages, on the other hand, do not show any statistically significant improvement in `mem_used_max` metrics. T-test for order 4: x order-2-maxmem + order-4-maxmem +-----------------------------------------------------------------------------+ |+ + + x x + + x + * x x| | |_______________________A___M________________A_|_____M_______| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08 No difference proven at 95.0% confidence Overall, with sufficient level of confidence, order 3 zspages appear to be beneficial for these particular use-case and data patterns. Rather expectedly we also observed lower numbers of huge-pages when zsmalloc is configured with order 3 and order 4 zspages, for the reason already explained. 2) Synthetic test ============================================================================= Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |++ x| |A| A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |+ x| |+ x| |A A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <[email protected]>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <[email protected]>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

driver defect clean up: torvalds#40 torvalds#41 torvalds#99 torvalds#100 torvalds#395 torvalds#396 torvalds#475 torvalds#614 torvalds#669 Change-Id: I581aaa8a1b950278bbf74d0c94aa647de89e07a9 Signed-off-by: Evoke Zhang <[email protected]>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

Currently, test_progs outputs all stdout/stderr as it runs, and when it is done, prints a summary. It is non-trivial for tooling to parse that output and extract meaningful information from it. This change adds a new option, `--json-summary`/`-J` that let the caller specify a file where `test_progs{,-no_alu32}` can write a summary of the run in a json format that can later be parsed by tooling. Currently, it creates a summary section with successes/skipped/failures followed by a list of failed tests and subtests. A test contains the following fields: - name: the name of the test - number: the number of the test - message: the log message that was printed by the test. - failed: A boolean indicating whether the test failed or not. Currently we only output failed tests, but in the future, successful tests could be added. - subtests: A list of subtests associated with this test. A subtest contains the following fields: - name: same as above - number: sanme as above - message: the log message that was printed by the subtest. - failed: same as above but for the subtest An example run and json content below: ``` $ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print $1","}' | tr -d '\n') -j -J /tmp/test_progs.json $ jq < /tmp/test_progs.json | head -n 30 { "success": 29, "success_subtest": 23, "skipped": 3, "failed": 28, "results": [ { "name": "bpf_cookie", "number": 10, "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n", "failed": true, "subtests": [ { "name": "multi_kprobe_link_api", "number": 2, "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "multi_kprobe_attach_api", "number": 3, "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "lsm", "number": 8, "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n", "failed": true } ``` The file can then be used to print a summary of the test run and list of failing tests/subtests: ``` $ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"' Success: 29/23, Skipped: 3, Failed: 28 $ jq -r < /tmp/test_progs.json '.results | map([ if .failed then "#\(.number) \(.name)" else empty end, ( . as {name: $tname, number: $tnum} | .subtests | map( if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end ) ) ]) | flatten | .[]' | head -n 20 torvalds#10 bpf_cookie torvalds#10/2 bpf_cookie/multi_kprobe_link_api torvalds#10/3 bpf_cookie/multi_kprobe_attach_api torvalds#10/8 bpf_cookie/lsm torvalds#15 bpf_mod_race torvalds#15/1 bpf_mod_race/ksym (used_btfs UAF) torvalds#15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF) torvalds#36 cgroup_hierarchical_stats torvalds#61 deny_namespace torvalds#61/1 deny_namespace/unpriv_userns_create_no_bpf torvalds#73 fexit_stress torvalds#83 get_func_ip_test torvalds#99 kfunc_dynptr_param torvalds#99/1 kfunc_dynptr_param/dynptr_data_null torvalds#99/4 kfunc_dynptr_param/dynptr_data_null torvalds#100 kprobe_multi_bench_attach torvalds#100/1 kprobe_multi_bench_attach/kernel torvalds#100/2 kprobe_multi_bench_attach/modules torvalds#101 kprobe_multi_test torvalds#101/1 kprobe_multi_test/skel_api ``` Signed-off-by: Manu Bretelle <[email protected]>

Currently, test_progs outputs all stdout/stderr as it runs, and when it is done, prints a summary. It is non-trivial for tooling to parse that output and extract meaningful information from it. This change adds a new option, `--json-summary`/`-J` that let the caller specify a file where `test_progs{,-no_alu32}` can write a summary of the run in a json format that can later be parsed by tooling. Currently, it creates a summary section with successes/skipped/failures followed by a list of failed tests and subtests. A test contains the following fields: - name: the name of the test - number: the number of the test - message: the log message that was printed by the test. - failed: A boolean indicating whether the test failed or not. Currently we only output failed tests, but in the future, successful tests could be added. - subtests: A list of subtests associated with this test. A subtest contains the following fields: - name: same as above - number: sanme as above - message: the log message that was printed by the subtest. - failed: same as above but for the subtest An example run and json content below: ``` $ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print $1","}' | tr -d '\n') -j -J /tmp/test_progs.json $ jq < /tmp/test_progs.json | head -n 30 { "success": 29, "success_subtest": 23, "skipped": 3, "failed": 28, "results": [ { "name": "bpf_cookie", "number": 10, "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n", "failed": true, "subtests": [ { "name": "multi_kprobe_link_api", "number": 2, "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "multi_kprobe_attach_api", "number": 3, "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n", "failed": true }, { "name": "lsm", "number": 8, "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n", "failed": true } ``` The file can then be used to print a summary of the test run and list of failing tests/subtests: ``` $ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"' Success: 29/23, Skipped: 3, Failed: 28 $ jq -r < /tmp/test_progs.json '.results | map([ if .failed then "#\(.number) \(.name)" else empty end, ( . as {name: $tname, number: $tnum} | .subtests | map( if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end ) ) ]) | flatten | .[]' | head -n 20 torvalds#10 bpf_cookie torvalds#10/2 bpf_cookie/multi_kprobe_link_api torvalds#10/3 bpf_cookie/multi_kprobe_attach_api torvalds#10/8 bpf_cookie/lsm torvalds#15 bpf_mod_race torvalds#15/1 bpf_mod_race/ksym (used_btfs UAF) torvalds#15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF) torvalds#36 cgroup_hierarchical_stats torvalds#61 deny_namespace torvalds#61/1 deny_namespace/unpriv_userns_create_no_bpf torvalds#73 fexit_stress torvalds#83 get_func_ip_test torvalds#99 kfunc_dynptr_param torvalds#99/1 kfunc_dynptr_param/dynptr_data_null torvalds#99/4 kfunc_dynptr_param/dynptr_data_null torvalds#100 kprobe_multi_bench_attach torvalds#100/1 kprobe_multi_bench_attach/kernel torvalds#100/2 kprobe_multi_bench_attach/modules torvalds#101 kprobe_multi_test torvalds#101/1 kprobe_multi_test/skel_api ``` Signed-off-by: Manu Bretelle <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Fix and update semantics for ops.enable() and ops.disable()

Merge pull request #1 from torvalds/master

02b5a60

pull from torvalds

mbin closed this Jun 18, 2014

hzhuang1 pushed a commit to hzhuang1/linux that referenced this pull request Aug 10, 2015

Merge pull request torvalds#100 from 96boards/working-syspll-get-from…

9212a52

…-registers-v3 Working syspll get from registers v3

iaguis pushed a commit to kinvolk/linux that referenced this pull request Feb 6, 2018

Merge pull request torvalds#100 from coreosbot/v4.13.8-coreos

dfb3337

Rebase patches onto 4.13.8

lluchs pushed a commit to KIT-OSGroup/linux that referenced this pull request May 10, 2022

Merge pull request torvalds#109 from hayley-leblanc/issue_100

b04a2c8

Fix crash consistency issue with alternate logs (torvalds#100)

logic10492 pushed a commit to logic10492/linux-amd-zen2 that referenced this pull request Jan 18, 2024

Merge pull request torvalds#100 from sched-ext/fix_enable

88568ae

Fix and update semantics for ops.enable() and ops.disable()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge pull request #1 from torvalds/master #100

Merge pull request #1 from torvalds/master #100

mbin commented Jun 18, 2014

Merge pull request #1 from torvalds/master #100

Merge pull request #1 from torvalds/master #100

Conversation

mbin commented Jun 18, 2014