Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge pull request #1 from torvalds/master #100

Closed
wants to merge 1 commit into from
Closed

Merge pull request #1 from torvalds/master #100

wants to merge 1 commit into from

Conversation

mbin
Copy link

@mbin mbin commented Jun 18, 2014

pull from torvalds

@mbin mbin closed this Jun 18, 2014
lkundrak pushed a commit to hackerspace/rpi-linux that referenced this pull request Jun 22, 2014
enable_irq_wake() might fail, if so, we will see kernel warning in resume
entries due to it always calls disable_irq_wake().

  WARNING: at kernel/irq/manage.c:529 irq_set_irq_wake+0xc4/0xf0()
  Unbalanced IRQ 52 wake disable
  Modules linked in: ipv6 libcomposite configfs
  CPU: 0 PID: 1591 Comm: ash Tainted: G        W    3.10.0-00854-gdbd86d4-dirty torvalds#100
    (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
    (show_stack+0x10/0x14) from (warn_slowpath_common+0x54/0x68)
    (warn_slowpath_common+0x54/0x68) from (warn_slowpath_fmt+0x30/0x40)
    (warn_slowpath_fmt+0x30/0x40) from (irq_set_irq_wake+0xc4/0xf0)
    (irq_set_irq_wake+0xc4/0xf0) from (sirfsoc_rtc_restore+0x30/0x38)
    (sirfsoc_rtc_restore+0x30/0x38) from (platform_pm_restore+0x2c/0x50)
    (platform_pm_restore+0x2c/0x50) from (dpm_run_callback.clone.6+0x30/0xb0)
    (dpm_run_callback.clone.6+0x30/0xb0) from (device_resume+0x88/0x134)
    (device_resume+0x88/0x134) from (dpm_resume+0x114/0x230)
    (dpm_resume+0x114/0x230) from (hibernation_snapshot+0x178/0x1d0)
    (hibernation_snapshot+0x178/0x1d0) from (hibernate+0x130/0x1dc)
    (hibernate+0x130/0x1dc) from (state_store+0xb4/0xc0)
    (state_store+0xb4/0xc0) from (kobj_attr_store+0x14/0x20)
    (kobj_attr_store+0x14/0x20) from (sysfs_write_file+0xfc/0x17c)
    (sysfs_write_file+0xfc/0x17c) from (vfs_write+0xc8/0x194)
    (vfs_write+0xc8/0x194) from (SyS_write+0x40/0x6c)
    (SyS_write+0x40/0x6c) from (ret_fast_syscall+0x0/0x30)

To avoid unbalanced "IRQ wake disable", ensure that disable_irq_wake() is
called only when enable_irq_wake() have been successfully enabled.

Signed-off-by: Xianglong Du <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
lkundrak pushed a commit to hackerspace/rpi-linux that referenced this pull request Jun 22, 2014
Turn it into (for example):

[    0.073380] x86: Booting SMP configuration:
[    0.074005] .... node   #0, CPUs:          #1   #2   #3   #4   #5   torvalds#6   torvalds#7
[    0.603005] .... node   #1, CPUs:     torvalds#8   torvalds#9  torvalds#10  torvalds#11  torvalds#12  torvalds#13  torvalds#14  torvalds#15
[    1.200005] .... node   #2, CPUs:    torvalds#16  torvalds#17  torvalds#18  torvalds#19  torvalds#20  torvalds#21  torvalds#22  torvalds#23
[    1.796005] .... node   #3, CPUs:    torvalds#24  torvalds#25  torvalds#26  torvalds#27  torvalds#28  torvalds#29  torvalds#30  torvalds#31
[    2.393005] .... node   #4, CPUs:    torvalds#32  torvalds#33  torvalds#34  torvalds#35  torvalds#36  torvalds#37  torvalds#38  torvalds#39
[    2.996005] .... node   #5, CPUs:    torvalds#40  torvalds#41  torvalds#42  torvalds#43  torvalds#44  torvalds#45  torvalds#46  torvalds#47
[    3.600005] .... node   torvalds#6, CPUs:    torvalds#48  torvalds#49  torvalds#50  torvalds#51  #52  #53  torvalds#54  torvalds#55
[    4.202005] .... node   torvalds#7, CPUs:    torvalds#56  torvalds#57  #58  torvalds#59  torvalds#60  torvalds#61  torvalds#62  torvalds#63
[    4.811005] .... node   torvalds#8, CPUs:    torvalds#64  torvalds#65  torvalds#66  torvalds#67  torvalds#68  torvalds#69  #70  torvalds#71
[    5.421006] .... node   torvalds#9, CPUs:    torvalds#72  torvalds#73  torvalds#74  torvalds#75  torvalds#76  torvalds#77  torvalds#78  torvalds#79
[    6.032005] .... node  torvalds#10, CPUs:    torvalds#80  torvalds#81  torvalds#82  torvalds#83  torvalds#84  torvalds#85  torvalds#86  torvalds#87
[    6.648006] .... node  torvalds#11, CPUs:    torvalds#88  torvalds#89  torvalds#90  torvalds#91  torvalds#92  torvalds#93  torvalds#94  torvalds#95
[    7.262005] .... node  torvalds#12, CPUs:    torvalds#96  torvalds#97  torvalds#98  torvalds#99 torvalds#100 torvalds#101 torvalds#102 torvalds#103
[    7.865005] .... node  torvalds#13, CPUs:   torvalds#104 torvalds#105 torvalds#106 torvalds#107 torvalds#108 torvalds#109 torvalds#110 torvalds#111
[    8.466005] .... node  torvalds#14, CPUs:   torvalds#112 torvalds#113 torvalds#114 torvalds#115 torvalds#116 torvalds#117 torvalds#118 torvalds#119
[    9.073006] .... node  torvalds#15, CPUs:   torvalds#120 torvalds#121 torvalds#122 torvalds#123 torvalds#124 torvalds#125 torvalds#126 torvalds#127
[    9.679901] x86: Booted up 16 nodes, 128 CPUs

and drop useless elements.

Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
hzhuang1 pushed a commit to hzhuang1/linux that referenced this pull request Aug 10, 2015
…-registers-v3

Working syspll get from registers v3
sashalevin pushed a commit to sashalevin/linux-stable-security that referenced this pull request Apr 29, 2016
commit 66efdc7 upstream.

snd_seq_timer_open() didn't catch the whole error path but let through
if the timer id is a slave.  This may lead to Oops by accessing the
uninitialized pointer.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae
 IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 PGD 785cd067 PUD 76964067 PMD 0
 Oops: 0002 [#4] SMP
 CPU 0
 Pid: 4288, comm: trinity-child7 Tainted: G      D W 3.9.0-rc1+ torvalds#100 Bochs Bochs
 RIP: 0010:[<ffffffff819b3477>]  [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 RSP: 0018:ffff88006ece7d38  EFLAGS: 00010246
 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000
 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38
 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007
 FS:  00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290)
 Stack:
  0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d
  65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000
  ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520
 Call Trace:
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70
  [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120
  [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0
  [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20
  [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570
  [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0
  [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b

Reported-and-tested-by: Tommi Rantala <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
sashalevin pushed a commit to sashalevin/linux-stable-security that referenced this pull request Apr 29, 2016
commit 66efdc7 upstream.

snd_seq_timer_open() didn't catch the whole error path but let through
if the timer id is a slave.  This may lead to Oops by accessing the
uninitialized pointer.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae
 IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 PGD 785cd067 PUD 76964067 PMD 0
 Oops: 0002 [#4] SMP
 CPU 0
 Pid: 4288, comm: trinity-child7 Tainted: G      D W 3.9.0-rc1+ torvalds#100 Bochs Bochs
 RIP: 0010:[<ffffffff819b3477>]  [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 RSP: 0018:ffff88006ece7d38  EFLAGS: 00010246
 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000
 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38
 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007
 FS:  00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290)
 Stack:
  0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d
  65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000
  ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520
 Call Trace:
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70
  [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120
  [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0
  [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20
  [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570
  [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0
  [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b

Reported-and-tested-by: Tommi Rantala <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
sashalevin pushed a commit to sashalevin/linux-stable-security that referenced this pull request Apr 29, 2016
commit 66efdc7 upstream.

snd_seq_timer_open() didn't catch the whole error path but let through
if the timer id is a slave.  This may lead to Oops by accessing the
uninitialized pointer.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae
 IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 PGD 785cd067 PUD 76964067 PMD 0
 Oops: 0002 [#4] SMP
 CPU 0
 Pid: 4288, comm: trinity-child7 Tainted: G      D W 3.9.0-rc1+ torvalds#100 Bochs Bochs
 RIP: 0010:[<ffffffff819b3477>]  [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 RSP: 0018:ffff88006ece7d38  EFLAGS: 00010246
 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000
 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38
 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007
 FS:  00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290)
 Stack:
  0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d
  65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000
  ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520
 Call Trace:
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70
  [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120
  [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0
  [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20
  [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570
  [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0
  [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b

Reported-and-tested-by: Tommi Rantala <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
sashalevin pushed a commit to sashalevin/linux-stable-security that referenced this pull request Apr 29, 2016
commit 66efdc7 upstream.

snd_seq_timer_open() didn't catch the whole error path but let through
if the timer id is a slave.  This may lead to Oops by accessing the
uninitialized pointer.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000002ae
 IP: [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 PGD 785cd067 PUD 76964067 PMD 0
 Oops: 0002 [#4] SMP
 CPU 0
 Pid: 4288, comm: trinity-child7 Tainted: G      D W 3.9.0-rc1+ torvalds#100 Bochs Bochs
 RIP: 0010:[<ffffffff819b3477>]  [<ffffffff819b3477>] snd_seq_timer_open+0xe7/0x130
 RSP: 0018:ffff88006ece7d38  EFLAGS: 00010246
 RAX: 0000000000000286 RBX: ffff88007851b400 RCX: 0000000000000000
 RDX: 000000000000ffff RSI: ffff88006ece7d58 RDI: ffff88006ece7d38
 RBP: ffff88006ece7d98 R08: 000000000000000a R09: 000000000000fffe
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800792c5400 R14: 0000000000e8f000 R15: 0000000000000007
 FS:  00007f7aaa650700(0000) GS:ffff88007f800000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000002ae CR3: 000000006efec000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process trinity-child7 (pid: 4288, threadinfo ffff88006ece6000, task ffff880076a8a290)
 Stack:
  0000000000000286 ffffffff828f2be0 ffff88006ece7d58 ffffffff810f354d
  65636e6575716573 2065756575712072 ffff8800792c0030 0000000000000000
  ffff88006ece7d98 ffff8800792c5400 ffff88007851b400 ffff8800792c5520
 Call Trace:
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff819b17e9>] snd_seq_queue_timer_open+0x29/0x70
  [<ffffffff819ae01a>] snd_seq_ioctl_set_queue_timer+0xda/0x120
  [<ffffffff819acb9b>] snd_seq_do_ioctl+0x9b/0xd0
  [<ffffffff819acbe0>] snd_seq_ioctl+0x10/0x20
  [<ffffffff811b9542>] do_vfs_ioctl+0x522/0x570
  [<ffffffff8130a4b3>] ? file_has_perm+0x83/0xa0
  [<ffffffff810f354d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff811b95ed>] sys_ioctl+0x5d/0xa0
  [<ffffffff813663fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81faed69>] system_call_fastpath+0x16/0x1b

Reported-and-tested-by: Tommi Rantala <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
0day-ci pushed a commit to 0day-ci/linux that referenced this pull request Aug 26, 2016
…eckpatch-fixes

WARNING: do not add new typedefs
torvalds#86: FILE: arch/powerpc/include/asm/elf_util.h:35:
+typedef unsigned long func_desc_t;

WARNING: do not add new typedefs
torvalds#90: FILE: arch/powerpc/include/asm/elf_util.h:39:
+typedef struct ppc64_opd_entry func_desc_t;

WARNING: Block comments use * on subsequent lines
torvalds#94: FILE: arch/powerpc/include/asm/elf_util.h:43:
+/* Like PPC32, we need little trampolines to do > 24-bit jumps (into
+   the kernel itself).  But on PPC64, these need to be used for every

WARNING: Block comments use a trailing */ on a separate line
torvalds#95: FILE: arch/powerpc/include/asm/elf_util.h:44:
+   jump, actually, to reset r2 (TOC+0x8000). */

ERROR: open brace '{' following struct go on the same line
torvalds#97: FILE: arch/powerpc/include/asm/elf_util.h:46:
+struct ppc64_stub_entry
+{

WARNING: Block comments use a trailing */ on a separate line
torvalds#100: FILE: arch/powerpc/include/asm/elf_util.h:49:
+	 * so we don't have to modify the trampoline load instruction. */

WARNING: Block comments use * on subsequent lines
torvalds#110: FILE: arch/powerpc/include/asm/elf_util.h:59:
+/* r2 is the TOC pointer: it actually points 0x8000 into the TOC (this
+   gives the value maximum span in an instruction which uses a signed

WARNING: Block comments use a trailing */ on a separate line
torvalds#111: FILE: arch/powerpc/include/asm/elf_util.h:60:
+   offset) */

WARNING: Block comments use * on subsequent lines
torvalds#132: FILE: arch/powerpc/include/asm/module.h:18:
+/* Both low and high 16 bits are added as SIGNED additions, so if low
+   16 bits has high bit set, high 16 bits must be adjusted.  These

WARNING: Block comments use a trailing */ on a separate line
torvalds#133: FILE: arch/powerpc/include/asm/module.h:19:
+   macros do that (stolen from binutils). */

WARNING: space prohibited between function name and open parenthesis '('
torvalds#136: FILE: arch/powerpc/include/asm/module.h:22:
+#define PPC_HA(v) PPC_HI ((v) + 0x8000)

ERROR: Macros with complex values should be enclosed in parentheses
torvalds#136: FILE: arch/powerpc/include/asm/module.h:22:
+#define PPC_HA(v) PPC_HI ((v) + 0x8000)

WARNING: please, no spaces at the start of a line
torvalds#210: FILE: arch/powerpc/kernel/elf_util_64.c:32:
+ (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) >> 2) << 2)$

WARNING: Block comments use a trailing */ on a separate line
torvalds#216: FILE: arch/powerpc/kernel/elf_util_64.c:38:
+	 * of function and try to derive r2 from it). */

WARNING: line over 80 characters
torvalds#357: FILE: arch/powerpc/kernel/elf_util_64.c:179:
+				value = stub_for_addr(elf_info, value, obj_name);

WARNING: line over 80 characters
torvalds#363: FILE: arch/powerpc/kernel/elf_util_64.c:185:
+				squash_toc_save_inst(strtab + sym->st_name, value);

ERROR: space required before the open brace '{'
torvalds#369: FILE: arch/powerpc/kernel/elf_util_64.c:191:
+			if (value + 0x2000000 > 0x3ffffff || (value & 3) != 0){

WARNING: line over 80 characters
torvalds#560: FILE: arch/powerpc/kernel/module_64.c:341:
+	sechdrs[me->arch.elf_info.stubs_section].sh_size = get_stubs_size(hdr, sechdrs);

WARNING: line over 80 characters
torvalds#613: FILE: arch/powerpc/kernel/module_64.c:380:
+	struct elf_shdr *stubs_sec = &elf_info->sechdrs[elf_info->stubs_section];

WARNING: line over 80 characters
torvalds#889: FILE: arch/powerpc/kernel/module_64.c:498:
+	num_stubs = sechdrs[me->arch.elf_info.stubs_section].sh_size / sizeof(*entry);

total: 3 errors, 17 warnings, 830 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/powerpc-factor-out-relocation-code-from-module_64c-to-elf_util_64c.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Thiago Jung Bauermann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
koct9i pushed a commit to koct9i/linux that referenced this pull request Aug 27, 2016
…eckpatch-fixes

WARNING: do not add new typedefs
torvalds#86: FILE: arch/powerpc/include/asm/elf_util.h:35:
+typedef unsigned long func_desc_t;

WARNING: do not add new typedefs
torvalds#90: FILE: arch/powerpc/include/asm/elf_util.h:39:
+typedef struct ppc64_opd_entry func_desc_t;

WARNING: Block comments use * on subsequent lines
torvalds#94: FILE: arch/powerpc/include/asm/elf_util.h:43:
+/* Like PPC32, we need little trampolines to do > 24-bit jumps (into
+   the kernel itself).  But on PPC64, these need to be used for every

WARNING: Block comments use a trailing */ on a separate line
torvalds#95: FILE: arch/powerpc/include/asm/elf_util.h:44:
+   jump, actually, to reset r2 (TOC+0x8000). */

ERROR: open brace '{' following struct go on the same line
torvalds#97: FILE: arch/powerpc/include/asm/elf_util.h:46:
+struct ppc64_stub_entry
+{

WARNING: Block comments use a trailing */ on a separate line
torvalds#100: FILE: arch/powerpc/include/asm/elf_util.h:49:
+	 * so we don't have to modify the trampoline load instruction. */

WARNING: Block comments use * on subsequent lines
torvalds#110: FILE: arch/powerpc/include/asm/elf_util.h:59:
+/* r2 is the TOC pointer: it actually points 0x8000 into the TOC (this
+   gives the value maximum span in an instruction which uses a signed

WARNING: Block comments use a trailing */ on a separate line
torvalds#111: FILE: arch/powerpc/include/asm/elf_util.h:60:
+   offset) */

WARNING: Block comments use * on subsequent lines
torvalds#132: FILE: arch/powerpc/include/asm/module.h:18:
+/* Both low and high 16 bits are added as SIGNED additions, so if low
+   16 bits has high bit set, high 16 bits must be adjusted.  These

WARNING: Block comments use a trailing */ on a separate line
torvalds#133: FILE: arch/powerpc/include/asm/module.h:19:
+   macros do that (stolen from binutils). */

WARNING: space prohibited between function name and open parenthesis '('
torvalds#136: FILE: arch/powerpc/include/asm/module.h:22:
+#define PPC_HA(v) PPC_HI ((v) + 0x8000)

ERROR: Macros with complex values should be enclosed in parentheses
torvalds#136: FILE: arch/powerpc/include/asm/module.h:22:
+#define PPC_HA(v) PPC_HI ((v) + 0x8000)

WARNING: please, no spaces at the start of a line
torvalds#210: FILE: arch/powerpc/kernel/elf_util_64.c:32:
+ (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) >> 2) << 2)$

WARNING: Block comments use a trailing */ on a separate line
torvalds#216: FILE: arch/powerpc/kernel/elf_util_64.c:38:
+	 * of function and try to derive r2 from it). */

WARNING: line over 80 characters
torvalds#357: FILE: arch/powerpc/kernel/elf_util_64.c:179:
+				value = stub_for_addr(elf_info, value, obj_name);

WARNING: line over 80 characters
torvalds#363: FILE: arch/powerpc/kernel/elf_util_64.c:185:
+				squash_toc_save_inst(strtab + sym->st_name, value);

ERROR: space required before the open brace '{'
torvalds#369: FILE: arch/powerpc/kernel/elf_util_64.c:191:
+			if (value + 0x2000000 > 0x3ffffff || (value & 3) != 0){

WARNING: line over 80 characters
torvalds#560: FILE: arch/powerpc/kernel/module_64.c:341:
+	sechdrs[me->arch.elf_info.stubs_section].sh_size = get_stubs_size(hdr, sechdrs);

WARNING: line over 80 characters
torvalds#613: FILE: arch/powerpc/kernel/module_64.c:380:
+	struct elf_shdr *stubs_sec = &elf_info->sechdrs[elf_info->stubs_section];

WARNING: line over 80 characters
torvalds#889: FILE: arch/powerpc/kernel/module_64.c:498:
+	num_stubs = sechdrs[me->arch.elf_info.stubs_section].sh_size / sizeof(*entry);

total: 3 errors, 17 warnings, 830 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/powerpc-factor-out-relocation-code-from-module_64c-to-elf_util_64c.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Thiago Jung Bauermann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Apr 7, 2017
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 torvalds#100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Signed-off-by: Chandan Rajendra <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Apr 8, 2017
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 torvalds#100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Reported-by: Santhosh G <[email protected]>
Signed-off-by: Chandan Rajendra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
djwong pushed a commit to djwong/linux that referenced this pull request Apr 25, 2017
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 torvalds#100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Reported-by: Santhosh G <[email protected]>
Signed-off-by: Chandan Rajendra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
dcui pushed a commit to dcui/linux that referenced this pull request Jul 26, 2017
BugLink: http://bugs.launchpad.net/bugs/1697955

commit a008c31 upstream.

On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 torvalds#100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Reported-by: Santhosh G <[email protected]>
Signed-off-by: Chandan Rajendra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
iaguis pushed a commit to kinvolk/linux that referenced this pull request Feb 6, 2018
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Jul 24, 2018
ERROR: space prohibited after that '-' (ctx:WxW)
torvalds#87: FILE: fs/hfs/hfs_fs.h:265:
+	ut -= - sys_tz.tz_minuteswest * 60;
 	      ^

WARNING: line over 80 characters
torvalds#100: FILE: fs/hfs/hfs_fs.h:276:
+#define hfs_m_to_utime(time)   (struct timespec){ .tv_sec = __hfs_m_to_utime(time) }

total: 1 errors, 1 warnings, 71 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/hfs-hfsplus-follow-macos-time-behavior.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Sep 19, 2018
These NEON and non-NEON implementations come from Andy Polyakov's
implementation. They are exactly the same as Andy Polyakov's original,
with the following exceptions:

- Entries and exits use the proper kernel convention macro.
- CPU feature checking is done in C by the glue code, so that has been
  removed from the assembly.
- The function names have been renamed to fit kernel conventions.
- Labels have been renamed (prefixed with .L) to fit kernel conventions.
- Constants have been rearranged so that they are closer to the code
  that is using them. [ARM only]
- The neon code can jump to the scalar code when it makes sense to do
  so.
- The neon_512 function as a separate function has been removed, leaving
  the decision up to the main neon entry point. [ARM64 only]

After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual
instructions from the original.

ARM:

-ChaCha20_ctr32:
-.LChaCha20_ctr32:
+ENTRY(chacha20_arm)
 	ldr	r12,[sp,#0]		@ pull pointer to counter and nonce
 	stmdb	sp!,{r0-r2,r4-r11,lr}
-	sub	r14,pc,torvalds#16		@ ChaCha20_ctr32
-	adr	r14,.LChaCha20_ctr32
 	cmp	r2,#0			@ len==0?
 	itt	eq
 	addeq	sp,sp,#4*3
-	beq	.Lno_data
-	cmp	r2,torvalds#192			@ test len
-	bls	.Lshort
-	ldr	r4,[r14,#-32]
-	ldr	r4,[r14,r4]
-	ldr	r4,[r4]
-	tst	r4,#ARMV7_NEON
-	bne	.LChaCha20_neon
+	beq	.Lno_data_arm
 .Lshort:
 	ldmia	r12,{r4-r7}		@ load counter and nonce
 	sub	sp,sp,#4*(16)		@ off-load area
-	sub	r14,r14,torvalds#64		@ .Lsigma
+	sub	r14,pc,torvalds#100		@ .Lsigma
+	adr	r14,.Lsigma		@ .Lsigma
 	stmdb	sp!,{r4-r7}		@ copy counter and nonce
 	ldmia	r3,{r4-r11}		@ load key
 	ldmia	r14,{r0-r3}		@ load sigma
@@ -617,14 +615,25 @@

 .Ldone:
 	add	sp,sp,#4*(32+3)
-.Lno_data:
+.Lno_data_arm:
 	ldmia	sp!,{r4-r11,pc}
+ENDPROC(chacha20_arm)

-ChaCha20_neon:
+ENTRY(chacha20_neon)
 	ldr		r12,[sp,#0]		@ pull pointer to counter and nonce
 	stmdb		sp!,{r0-r2,r4-r11,lr}
-.LChaCha20_neon:
-	adr		r14,.Lsigma
+	cmp		r2,#0			@ len==0?
+	itt		eq
+	addeq		sp,sp,#4*3
+	beq		.Lno_data_neon
+	cmp		r2,torvalds#192			@ test len
+	bls		.Lshort
+.Lchacha20_neon_begin:
+	adr		r14,.Lsigma2
 	vstmdb		sp!,{d8-d15}		@ ABI spec says so
 	stmdb		sp!,{r0-r3}

@@ -1265,4 +1274,6 @@
 	add		sp,sp,#4*(32+4)
 	vldmia		sp,{d8-d15}
 	add		sp,sp,#4*(16+3)
+.Lno_data_neon:
 	ldmia		sp!,{r4-r11,pc}
+ENDPROC(chacha20_neon)

ARM64:

-ChaCha20_ctr32:
+ENTRY(chacha20_arm)
 	cbz	x2,.Labort
-	adr	x5,.LOPENSSL_armcap_P
-	cmp	x2,torvalds#192
-	b.lo	.Lshort
-	ldrsw	x6,[x5]
-	ldr	x6,[x5]
-	ldr	w17,[x6,x5]
-	tst	w17,#ARMV7_NEON
-	b.ne	ChaCha20_neon
-
 .Lshort:
 	stp	x29,x30,[sp,#-96]!
 	add	x29,sp,#0
@@ -279,8 +274,13 @@
 	ldp	x27,x28,[x29,torvalds#80]
 	ldp	x29,x30,[sp],torvalds#96
 	ret
+ENDPROC(chacha20_arm)
+
+ENTRY(chacha20_neon)
+	cbz	x2,.Labort_neon
+	cmp	x2,torvalds#192
+	b.lo	.Lshort

-ChaCha20_neon:
 	stp	x29,x30,[sp,#-96]!
 	add	x29,sp,#0

@@ -763,16 +763,6 @@
 	ldp	x27,x28,[x29,torvalds#80]
 	ldp	x29,x30,[sp],torvalds#96
 	ret
-ChaCha20_512_neon:
-	stp	x29,x30,[sp,#-96]!
-	add	x29,sp,#0
-
-	adr	x5,.Lsigma
-	stp	x19,x20,[sp,torvalds#16]
-	stp	x21,x22,[sp,torvalds#32]
-	stp	x23,x24,[sp,torvalds#48]
-	stp	x25,x26,[sp,torvalds#64]
-	stp	x27,x28,[sp,torvalds#80]

 .L512_or_more_neon:
 	sub	sp,sp,torvalds#128+64
@@ -1920,4 +1910,6 @@
 	ldp	x25,x26,[x29,torvalds#64]
 	ldp	x27,x28,[x29,torvalds#80]
 	ldp	x29,x30,[sp],torvalds#96
+.Labort_neon:
 	ret
+ENDPROC(chacha20_neon)

Signed-off-by: Jason A. Donenfeld <[email protected]>
Cc: Samuel Neves <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Jean-Philippe Aumasson <[email protected]>
Cc: Andy Polyakov <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 11, 2019
If we drop the engine lock, we may run execlists_dequeue which may free
the priolist. Therefore if we ever drop the execution lock on the
engine, we have to discard our cache and refetch the priolist to ensure
we do not use a stale pointer.

[  506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority
[  593.240825] general protection fault: 0000 [#1] SMP
[  593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G     U            5.0.0-rc6+ torvalds#100
[  593.240879] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[  593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915]
[  593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85
[  593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046
[  593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0
[  593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194
[  593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840
[  593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728
[  593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158
[  593.241120] FS:  00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000
[  593.241133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0
[  593.241158] Call Trace:
[  593.241233]  i915_schedule+0x1f/0x30 [i915]
[  593.241326]  i915_request_add+0x1a9/0x290 [i915]
[  593.241393]  i915_gem_do_execbuffer+0x45f/0x1150 [i915]
[  593.241411]  ? init_object+0x49/0x80
[  593.241425]  ? ___slab_alloc.constprop.91+0x4b8/0x4e0
[  593.241491]  ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915]
[  593.241563]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241629]  i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915]
[  593.241705]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241724]  drm_ioctl_kernel+0x81/0xd0
[  593.241738]  drm_ioctl+0x1a7/0x310
[  593.241803]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241819]  ? __update_load_avg_se+0x1c9/0x240
[  593.241834]  ? pick_next_entity+0x7e/0x120
[  593.241851]  do_vfs_ioctl+0x88/0x5d0
[  593.241880]  ksys_ioctl+0x35/0x70
[  593.241894]  __x64_sys_ioctl+0x11/0x20
[  593.241907]  do_syscall_64+0x44/0xf0
[  593.241924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  593.241940] RIP: 0033:0x7f4615ffe757
[  593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48
[  593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757
[  593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003
[  593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240
[  593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[  593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers

v2: Track the local engine cache and explicitly clear it when switching
engine locks.

Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling")
Testcase: igt/gem_exec_whisper/contexts-priority # rare!
Signed-off-by: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Michał Winiarski <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 11, 2019
If we drop the engine lock, we may run execlists_dequeue which may free
the priolist. Therefore if we ever drop the execution lock on the
engine, we have to discard our cache and refetch the priolist to ensure
we do not use a stale pointer.

[  506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority
[  593.240825] general protection fault: 0000 [#1] SMP
[  593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G     U            5.0.0-rc6+ torvalds#100
[  593.240879] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[  593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915]
[  593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85
[  593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046
[  593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0
[  593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194
[  593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840
[  593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728
[  593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158
[  593.241120] FS:  00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000
[  593.241133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0
[  593.241158] Call Trace:
[  593.241233]  i915_schedule+0x1f/0x30 [i915]
[  593.241326]  i915_request_add+0x1a9/0x290 [i915]
[  593.241393]  i915_gem_do_execbuffer+0x45f/0x1150 [i915]
[  593.241411]  ? init_object+0x49/0x80
[  593.241425]  ? ___slab_alloc.constprop.91+0x4b8/0x4e0
[  593.241491]  ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915]
[  593.241563]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241629]  i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915]
[  593.241705]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241724]  drm_ioctl_kernel+0x81/0xd0
[  593.241738]  drm_ioctl+0x1a7/0x310
[  593.241803]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241819]  ? __update_load_avg_se+0x1c9/0x240
[  593.241834]  ? pick_next_entity+0x7e/0x120
[  593.241851]  do_vfs_ioctl+0x88/0x5d0
[  593.241880]  ksys_ioctl+0x35/0x70
[  593.241894]  __x64_sys_ioctl+0x11/0x20
[  593.241907]  do_syscall_64+0x44/0xf0
[  593.241924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  593.241940] RIP: 0033:0x7f4615ffe757
[  593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48
[  593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757
[  593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003
[  593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240
[  593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[  593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers

v2: Track the local engine cache and explicitly clear it when switching
engine locks.

Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling")
Testcase: igt/gem_exec_whisper/contexts-priority # rare!
Signed-off-by: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Michał Winiarski <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 11, 2019
If we drop the engine lock, we may run execlists_dequeue which may free
the priolist. Therefore if we ever drop the execution lock on the
engine, we have to discard our cache and refetch the priolist to ensure
we do not use a stale pointer.

[  506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority
[  593.240825] general protection fault: 0000 [#1] SMP
[  593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G     U            5.0.0-rc6+ torvalds#100
[  593.240879] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[  593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915]
[  593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85
[  593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046
[  593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0
[  593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194
[  593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840
[  593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728
[  593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158
[  593.241120] FS:  00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000
[  593.241133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0
[  593.241158] Call Trace:
[  593.241233]  i915_schedule+0x1f/0x30 [i915]
[  593.241326]  i915_request_add+0x1a9/0x290 [i915]
[  593.241393]  i915_gem_do_execbuffer+0x45f/0x1150 [i915]
[  593.241411]  ? init_object+0x49/0x80
[  593.241425]  ? ___slab_alloc.constprop.91+0x4b8/0x4e0
[  593.241491]  ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915]
[  593.241563]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241629]  i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915]
[  593.241705]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241724]  drm_ioctl_kernel+0x81/0xd0
[  593.241738]  drm_ioctl+0x1a7/0x310
[  593.241803]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241819]  ? __update_load_avg_se+0x1c9/0x240
[  593.241834]  ? pick_next_entity+0x7e/0x120
[  593.241851]  do_vfs_ioctl+0x88/0x5d0
[  593.241880]  ksys_ioctl+0x35/0x70
[  593.241894]  __x64_sys_ioctl+0x11/0x20
[  593.241907]  do_syscall_64+0x44/0xf0
[  593.241924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  593.241940] RIP: 0033:0x7f4615ffe757
[  593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48
[  593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757
[  593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003
[  593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240
[  593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[  593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers

Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling")
Testcase: igt/gem_exec_whisper/contexts-priority # rare!
Signed-off-by: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Michał Winiarski <[email protected]>
djdeath pushed a commit to djdeath/linux that referenced this pull request Feb 18, 2019
If we drop the engine lock, we may run execlists_dequeue which may free
the priolist. Therefore if we ever drop the execution lock on the
engine, we have to discard our cache and refetch the priolist to ensure
we do not use a stale pointer.

[  506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority
[  593.240825] general protection fault: 0000 [rib#1] SMP
[  593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G     U            5.0.0-rc6+ torvalds#100
[  593.240879] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[  593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915]
[  593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85
[  593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046
[  593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0
[  593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194
[  593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840
[  593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728
[  593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158
[  593.241120] FS:  00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000
[  593.241133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0
[  593.241158] Call Trace:
[  593.241233]  i915_schedule+0x1f/0x30 [i915]
[  593.241326]  i915_request_add+0x1a9/0x290 [i915]
[  593.241393]  i915_gem_do_execbuffer+0x45f/0x1150 [i915]
[  593.241411]  ? init_object+0x49/0x80
[  593.241425]  ? ___slab_alloc.constprop.91+0x4b8/0x4e0
[  593.241491]  ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915]
[  593.241563]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241629]  i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915]
[  593.241705]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241724]  drm_ioctl_kernel+0x81/0xd0
[  593.241738]  drm_ioctl+0x1a7/0x310
[  593.241803]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241819]  ? __update_load_avg_se+0x1c9/0x240
[  593.241834]  ? pick_next_entity+0x7e/0x120
[  593.241851]  do_vfs_ioctl+0x88/0x5d0
[  593.241880]  ksys_ioctl+0x35/0x70
[  593.241894]  __x64_sys_ioctl+0x11/0x20
[  593.241907]  do_syscall_64+0x44/0xf0
[  593.241924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  593.241940] RIP: 0033:0x7f4615ffe757
[  593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48
[  593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757
[  593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003
[  593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240
[  593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[  593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers

v2: Track the local engine cache and explicitly clear it when switching
engine locks.

Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling")
Testcase: igt/gem_exec_whisper/contexts-priority # rare!
Signed-off-by: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Michał Winiarski <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Feb 20, 2019
If we drop the engine lock, we may run execlists_dequeue which may free
the priolist. Therefore if we ever drop the execution lock on the
engine, we have to discard our cache and refetch the priolist to ensure
we do not use a stale pointer.

[  506.418935] [IGT] gem_exec_whisper: starting subtest contexts-priority
[  593.240825] general protection fault: 0000 [#1] SMP
[  593.240863] CPU: 1 PID: 494 Comm: gem_exec_whispe Tainted: G     U            5.0.0-rc6+ torvalds#100
[  593.240879] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[  593.240965] RIP: 0010:__i915_schedule+0x1fe/0x320 [i915]
[  593.240981] Code: 48 8b 0c 24 48 89 c3 49 8b 45 28 49 8b 75 20 4c 89 3c 24 48 89 46 08 48 89 30 48 8b 43 08 48 89 4b 08 49 89 5d 20 49 89 45 28 <48> 89 08 45 39 a7 b8 03 00 00 7d 44 45 89 a7 b8 03 00 00 49 8b 85
[  593.240999] RSP: 0018:ffffc90000057a60 EFLAGS: 00010046
[  593.241013] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8882582d7870 RCX: ffff88826baba6f0
[  593.241026] RDX: 0000000000000000 RSI: ffff8882582d6e70 RDI: ffff888273482194
[  593.241049] RBP: ffffc90000057a68 R08: ffff8882582d7680 R09: ffff8882582d7840
[  593.241068] R10: 0000000000000000 R11: ffffea00095ebe08 R12: 0000000000000728
[  593.241105] R13: ffff88826baba6d0 R14: ffffc90000057a40 R15: ffff888273482158
[  593.241120] FS:  00007f4613fb3900(0000) GS:ffff888277a80000(0000) knlGS:0000000000000000
[  593.241133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  593.241146] CR2: 00007f57d3c66a84 CR3: 000000026e2b6000 CR4: 00000000001406e0
[  593.241158] Call Trace:
[  593.241233]  i915_schedule+0x1f/0x30 [i915]
[  593.241326]  i915_request_add+0x1a9/0x290 [i915]
[  593.241393]  i915_gem_do_execbuffer+0x45f/0x1150 [i915]
[  593.241411]  ? init_object+0x49/0x80
[  593.241425]  ? ___slab_alloc.constprop.91+0x4b8/0x4e0
[  593.241491]  ? i915_gem_execbuffer2_ioctl+0x99/0x380 [i915]
[  593.241563]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241629]  i915_gem_execbuffer2_ioctl+0x1bb/0x380 [i915]
[  593.241705]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241724]  drm_ioctl_kernel+0x81/0xd0
[  593.241738]  drm_ioctl+0x1a7/0x310
[  593.241803]  ? i915_gem_execbuffer_ioctl+0x270/0x270 [i915]
[  593.241819]  ? __update_load_avg_se+0x1c9/0x240
[  593.241834]  ? pick_next_entity+0x7e/0x120
[  593.241851]  do_vfs_ioctl+0x88/0x5d0
[  593.241880]  ksys_ioctl+0x35/0x70
[  593.241894]  __x64_sys_ioctl+0x11/0x20
[  593.241907]  do_syscall_64+0x44/0xf0
[  593.241924]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  593.241940] RIP: 0033:0x7f4615ffe757
[  593.241952] Code: 00 00 90 48 8b 05 39 a7 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 a7 0c 00 f7 d8 64 89 01 48
[  593.241970] RSP: 002b:00007ffc1030ddf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  593.241984] RAX: ffffffffffffffda RBX: 00007ffc10324420 RCX: 00007f4615ffe757
[  593.241997] RDX: 00007ffc1030e220 RSI: 0000000040406469 RDI: 0000000000000003
[  593.242010] RBP: 00007ffc1030e220 R08: 00007f46160c9208 R09: 00007f46160c9240
[  593.242022] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[  593.242038] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  593.242058] Modules linked in: i915 intel_gtt drm_kms_helper prime_numbers

v2: Track the local engine cache and explicitly clear it when switching
engine locks.

Fixes: a02eb97 ("drm/i915/execlists: Cache the priolist when rescheduling")
Testcase: igt/gem_exec_whisper/contexts-priority # rare!
Signed-off-by: Chris Wilson <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Michał Winiarski <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ed7dc67)
Signed-off-by: Rodrigo Vivi <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Oct 15, 2021
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Fixes: f1a0a37 ("sched/core: Initialize the idle task with preemption disabled")
Signed-off-by: Nathan Lynch <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Oct 15, 2021
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
mpe pushed a commit to linuxppc/linux that referenced this pull request Oct 16, 2021
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
ruscur pushed a commit to ruscur/linux that referenced this pull request Oct 20, 2021
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Oct 21, 2021
Race condition of page table update can happen in kernel_init as
both of memory hotplug module init and the following mark_rodata_ro can
update page table. The function excute flow chart is:

-------------------------
kernel_init
  kernel_init_freeable
    ...
      do_initcall
        ...
          module_init [A]

  ...
  mark_readonly
    mark_rodata_ro [B]
-------------------------
[A] can contains memory hotplug init therefore both [A] and [B] can
update page table at the same time that may lead to race. Here we
introduce memory hotplug lock to guard mark_rodata_ro to avoid the race
condition.

I catch the related error when test virtio-mem (a new memory hotplug
driver) on arm64 and may be a potential bug for other arches.

How to reproduce on arm64:
(1) prepare a kernel with virtio-mem enabled on arm64
(2) start a VM using Cloud Hypervisor[1] using the kernel above
(3) hotplug memory, 20G in my case, with virtio-mem
(4) reboot or load new kernel using kexec

Test for server times, you may find the error below:

[    1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140
[    1.134504] Mem abort info:
[    1.135722]   ESR = 0x96000007
[    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.139189]   SET = 0, FnV = 0
[    1.140467]   EA = 0, S1PTW = 0
[    1.141755]   FSC = 0x07: level 3 translation fault
[    1.143787] Data abort info:
[    1.144976]   ISV = 0, ISS = 0x00000007
[    1.146554]   CM = 0, WnR = 0
[    1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000
[    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000
[    1.155728] Internal error: Oops: 96000007 [#1] SMP
[    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-rc3+ torvalds#100
[    1.161002] Hardware name: linux,dummy-virt (DT)
[    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.165825] pc : alloc_init_pud+0x38c/0x550
[    1.167610] lr : alloc_init_pud+0x394/0x550
[    1.169358] sp : ffff80001001bd10
......
[    1.200527] Call trace:
[    1.201583]  alloc_init_pud+0x38c/0x550
[    1.203218]  __create_pgd_mapping+0x94/0xe0
[    1.204983]  update_mapping_prot+0x50/0xd8
[    1.206730]  mark_rodata_ro+0x50/0x58
[    1.208281]  kernel_init+0x3c/0x120
[    1.209760]  ret_from_fork+0x10/0x20
[    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
[    1.213856] ---[ end trace 59473413ffe3f52d ]---
[    1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

[1] https://github.com/cloud-hypervisor/cloud-hypervisor

Suggested-by: Anshuman Khandual <[email protected]>
Signed-off-by: Jianyong Wu <[email protected]>
jongwu added a commit to jongwu/linux that referenced this pull request Oct 27, 2021
Race condition of page table update can happen in kernel boot period as
both of memory hotplug in kernel init and the following mark_rodata_ro can
update page table. For virtio-mem, the function excute flow chart is:

-------------------------
kernel_init
  kernel_init_freeable
    ...
      do_initcall
        ...
          module_init [A]

  ...
  mark_readonly
    mark_rodata_ro [B]
-------------------------
virtio-mem can be initialized at [A] and spwan a workqueue to add
memory, therefore the race of update page table can happen inside [B].

What's more, the race condition can happen even for ACPI based memory
hotplug, as it can burst into kernel boot period while page table is
updating inside mark_rodata_ro.

That's why memory hotplug lock is needed to guard mark_rodata_ro to avoid
the race condition.

It may only happen in arm64. As fixmap, which is the global resource, is
used in page table creating. So, the change is only for arm64.

The error often occurs inside alloc_init_pud() in arch/arm64/mm/mmu.c

the race condition flow is:

*************** begin ************

kerenl_init                                 virtio-mem workqueue
=========                                   ========
alloc_init_pud(...)
  pudp = pud_set_fixmap_offset(..)          alloc_init_pud(...)
...                                         ...
    READ_ONCE(*pudp) //OK!                    pudp = pud_set_fixmap_offset(
...                                         ...
  pud_clear_fixmap() //fixmap break
                                              READ_ONCE(*pudp) //CRASH!

**************** end *************

I catch the related error when test virtio-mem (a new memory hotplug
driver) on arm64.

How to reproduce:
(1) prepare a kernel with virtio-mem enabled on arm64
(2) start a VM using Cloud Hypervisor using the kernel above
(3) hotplug memory, 20G in my case, with virtio-mem
(4) reboot or start a new kernel using kexec

Test for server times, you may find the error below:

[    1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140
[    1.134504] Mem abort info:
[    1.135722]   ESR = 0x96000007
[    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.139189]   SET = 0, FnV = 0
[    1.140467]   EA = 0, S1PTW = 0
[    1.141755]   FSC = 0x07: level 3 translation fault
[    1.143787] Data abort info:
[    1.144976]   ISV = 0, ISS = 0x00000007
[    1.146554]   CM = 0, WnR = 0
[    1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000
[    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000
[    1.155728] Internal error: Oops: 96000007 [cloud-hypervisor#1] SMP
[    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-rc3+ torvalds#100
[    1.161002] Hardware name: linux,dummy-virt (DT)
[    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.165825] pc : alloc_init_pud+0x38c/0x550
[    1.167610] lr : alloc_init_pud+0x394/0x550
[    1.169358] sp : ffff80001001bd10
......
[    1.200527] Call trace:
[    1.201583]  alloc_init_pud+0x38c/0x550
[    1.203218]  __create_pgd_mapping+0x94/0xe0
[    1.204983]  update_mapping_prot+0x50/0xd8
[    1.206730]  mark_rodata_ro+0x50/0x58
[    1.208281]  kernel_init+0x3c/0x120
[    1.209760]  ret_from_fork+0x10/0x20
[    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
[    1.213856] ---[ end trace 59473413ffe3f52d ]---
[    1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

We can see that the error derived from the l3 translation as the pte
value is *0*. That is because the fixmap has been clear when access.

Signed-off-by: Jianyong Wu <[email protected]>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Oct 27, 2021
Race condition of page table update can happen in kernel boot period as
both of memory hotplug action when kernel init and the mark_rodata_ro can
update page table. For virtio-mem, the function excute flow chart is:

-------------------------
kernel_init
  kernel_init_freeable
    ...
      do_initcall
        ...
          module_init [A]

  ...
  mark_readonly
    mark_rodata_ro [B]
-------------------------
virtio-mem can be initialized at [A] and spwan a workqueue to add
memory, therefore the race of update page table can happen inside [B].

What's more, the race condition can happen even for ACPI based memory
hotplug, as it can burst into kernel boot period while page table is
updating inside mark_rodata_ro.

That's why memory hotplug lock is needed to guard mark_rodata_ro to avoid
the race condition.

It may only happen in arm64. As fixmap, which is the global resource, is
used in page table creating. So, the change is only for arm64.

The error often occurs inside alloc_init_pud() in arch/arm64/mm/mmu.c

the race condition flow is:

*************** begin ************

kerenl_init                                 virtio-mem workqueue
=========                                   ========
alloc_init_pud(...)
  pudp = pud_set_fixmap_offset(..)          alloc_init_pud(...)
...                                         ...
    READ_ONCE(*pudp) //OK!                    pudp = pud_set_fixmap_offset(
...                                         ...
  pud_clear_fixmap() //fixmap break
                                              READ_ONCE(*pudp) //CRASH!

**************** end *************

I catch the related error when test virtio-mem (a new memory hotplug
driver) on arm64.

How to reproduce:
(1) prepare a kernel with virtio-mem enabled on arm64
(2) start a VM using Cloud Hypervisor using the kernel above
(3) hotplug memory, 20G in my case, with virtio-mem
(4) reboot or start a new kernel using kexec

Test for server times, you may find the error below:

[    1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140
[    1.134504] Mem abort info:
[    1.135722]   ESR = 0x96000007
[    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.139189]   SET = 0, FnV = 0
[    1.140467]   EA = 0, S1PTW = 0
[    1.141755]   FSC = 0x07: level 3 translation fault
[    1.143787] Data abort info:
[    1.144976]   ISV = 0, ISS = 0x00000007
[    1.146554]   CM = 0, WnR = 0
[    1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000
[    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000
[    1.155728] Internal error: Oops: 96000007 [#1] SMP
[    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-rc3+ torvalds#100
[    1.161002] Hardware name: linux,dummy-virt (DT)
[    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.165825] pc : alloc_init_pud+0x38c/0x550
[    1.167610] lr : alloc_init_pud+0x394/0x550
[    1.169358] sp : ffff80001001bd10
......
[    1.200527] Call trace:
[    1.201583]  alloc_init_pud+0x38c/0x550
[    1.203218]  __create_pgd_mapping+0x94/0xe0
[    1.204983]  update_mapping_prot+0x50/0xd8
[    1.206730]  mark_rodata_ro+0x50/0x58
[    1.208281]  kernel_init+0x3c/0x120
[    1.209760]  ret_from_fork+0x10/0x20
[    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
[    1.213856] ---[ end trace 59473413ffe3f52d ]---
[    1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

We can see that the error derived from the l3 translation as the pte
value is *0*. That is because the fixmap has been clear when access.

Signed-off-by: Jianyong Wu <[email protected]>
intersectRaven pushed a commit to intersectRaven/linux that referenced this pull request Oct 27, 2021
[ Upstream commit 787252a ]

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
joe-lawrence added a commit to joe-lawrence/linux that referenced this pull request Oct 28, 2021
Fix the following checkpatch complaints:

  ERROR: code indent should use tabs where possible
  torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43:
  +        return 0;$

  WARNING: please, no spaces at the start of a line
  torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43:
  +        return 0;$

  ERROR: code indent should use tabs where possible
  torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46:
  +        .set = print_debug_set,$

  WARNING: please, no spaces at the start of a line
  torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46:
  +        .set = print_debug_set,$

  ERROR: code indent should use tabs where possible
  torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47:
  +        .get = param_get_int,$

  WARNING: please, no spaces at the start of a line
  torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47:
  +        .get = param_get_int,$

  ERROR: code indent should use tabs where possible
  torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43:
  +        return 0;$

  WARNING: please, no spaces at the start of a line
  torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43:
  +        return 0;$

  ERROR: code indent should use tabs where possible
  torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46:
  +        .set = print_debug_set,$

  WARNING: please, no spaces at the start of a line
  torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46:
  +        .set = print_debug_set,$

  ERROR: code indent should use tabs where possible
  torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47:
  +        .get = param_get_int,$

  WARNING: please, no spaces at the start of a line
  torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47:
  +        .get = param_get_int,$

Signed-off-by: Joe Lawrence <[email protected]>
codelabs-bot pushed a commit to codelabs-ch/linux that referenced this pull request Oct 29, 2021
[ Upstream commit 787252a ]

With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:

BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ torvalds#100
Call Trace:
 dump_stack_lvl+0xac/0x108
 __schedule_bug+0xac/0xe0
 __schedule+0xcf8/0x10d0
 schedule_idle+0x3c/0x70
 do_idle+0x2d8/0x4a0
 cpu_startup_entry+0x38/0x40
 start_secondary+0x2ec/0x3a0
 start_secondary_prolog+0x10/0x14

This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."

However, since commit 2c669ef ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a37 ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.

The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.

Tested with pseries and powernv in qemu, and pseries on PowerVM.

Fixes: 2c669ef ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
chleroy pushed a commit to chleroy/linux that referenced this pull request Dec 14, 2021
Fix the following checkpatch complaints:

  ERROR: code indent should use tabs where possible
  torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43:
  +        return 0;$

  WARNING: please, no spaces at the start of a line
  torvalds#96: FILE: lib/livepatch/test_klp_convert1.c:43:
  +        return 0;$

  ERROR: code indent should use tabs where possible
  torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46:
  +        .set = print_debug_set,$

  WARNING: please, no spaces at the start of a line
  torvalds#99: FILE: lib/livepatch/test_klp_convert1.c:46:
  +        .set = print_debug_set,$

  ERROR: code indent should use tabs where possible
  torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47:
  +        .get = param_get_int,$

  WARNING: please, no spaces at the start of a line
  torvalds#100: FILE: lib/livepatch/test_klp_convert1.c:47:
  +        .get = param_get_int,$

  ERROR: code indent should use tabs where possible
  torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43:
  +        return 0;$

  WARNING: please, no spaces at the start of a line
  torvalds#221: FILE: lib/livepatch/test_klp_convert2.c:43:
  +        return 0;$

  ERROR: code indent should use tabs where possible
  torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46:
  +        .set = print_debug_set,$

  WARNING: please, no spaces at the start of a line
  torvalds#224: FILE: lib/livepatch/test_klp_convert2.c:46:
  +        .set = print_debug_set,$

  ERROR: code indent should use tabs where possible
  torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47:
  +        .get = param_get_int,$

  WARNING: please, no spaces at the start of a line
  torvalds#225: FILE: lib/livepatch/test_klp_convert2.c:47:
  +        .get = param_get_int,$

Signed-off-by: Joe Lawrence <[email protected]>
lluchs pushed a commit to KIT-OSGroup/linux that referenced this pull request May 10, 2022
Fix crash consistency issue with alternate logs (torvalds#100)
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 25, 2022
Some use-cases and/or data patterns may benefit from larger zspages. 
Currently the limit on the number of physical pages that are linked into a
zspage is hardcoded to 4.  Higher limit changes key characteristics of a
number of the size clases, improving compactness of the pool and redusing
the amount of memory zsmalloc pool uses.

For instance, the huge size class watermark is currently set to 3264
bytes.  With order 3 zspages we have more normal classe and huge size
watermark becomes 3632.  With order 4 zspages huge size watermark becomes
3840.

Commit #1 has more numbers and some analysis.
	

This patch (of 6):

zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100.  That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96 we
end up storing it in size class torvalds#100.  Class torvalds#100 is for objects of 1632
bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. 
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. 
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages.  A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage.  As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Let's take a closer look at the bottom of
/sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes.  Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object.  To
put it slightly differently - objects in huge classes don't share physical
pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254.  Similarly to class size torvalds#96 above, higher
order zspages change key characteristics for some of those huge size
classes and thus those classes become normal classes, where stored objects
share physical pages.

We move huge class watermark with higher order zspages.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

1) ChromeOS memory pressure test
-----------------------------------------------------------------------------

Our standard memory pressure test, that is designed with the reproducibility
in mind.

zram is configured as a swap device, lzo-rle compression algorithm.
We captured /sys/block/zram0/mm_stat after every test and rebooted
device.

Columns per (Documentation/admin-guide/blockdev/zram.rst)

orig_data_size        mem_used_total      mem_used_max         pages_compacted
          compr_data_size         mem_limit           same_pages          huge_pages

ORDER 2 (BASE)

10353639424 2981711944 3166896128        0 3543158784   579494   825135   123707
10168573952 2932288347 3106541568        0 3499085824   565187   853137   126153
9950461952 2815911234 3035693056        0 3441090560   586696   748054   122103
9892335616 2779566152 2943459328        0 3514736640   591541   650696   119621
9993949184 2814279212 3021357056        0 3336421376   582488   711744   121273
9953226752 2856382009 3025649664        0 3512893440   564559   787861   123034
9838448640 2785481728 2997575680        0 3367219200   573282   777099   122739

ORDER 3

9509138432 2706941227 2823393280        0 3389587456   535856  1011472    90223
10105245696 2882368370 3013095424        0 3296165888   563896  1059033    94808
9531236352 2666125512 2867650560        0 3396173824   567117  1126396    88807
9561812992 2714536764 2956652544        0 3310505984   548223   827322    90992
9807470592 2790315707 2908053504        0 3378315264   563670  1020933    93725
10178371584 2948838782 3071209472        0 3329548288   548533   954546    90730
9925165056 2849839413 2958274560        0 3336978432   551464  1058302    89381

ORDER 4

9444515840 2613362645 2668232704        0 3396759552   573735  1162207    83475
10129108992 2925888488 3038351360        0 3499597824   555634  1231542    84525
9876594688 2786692282 2897006592        0 3469463552   584835  1290535    84133
10012909568 2649711847 2801512448        0 3171323904   675405   750728    80424
10120966144 2866742402 2978639872        0 3257815040   587435  1093981    83587
9578790912 2671245225 2802270208        0 3376353280   545548  1047930    80895
10108588032 2888433523 2983960576        0 3316641792   571445  1290640    81402

First, we establish that order 3 and 4 don't cause any statistically
significant change in `orig_data_size` (number of bytes we store during
the test), in other words larger zspages don't cause regressions.

T-test for order 3:

x order-2-stored
+ order-3-stored
+-----------------------------------------------------------------------------+
|+ +  +                     +  x   x  +  x   x         +    x+               x|
| |________________________AM__|_________M_____A____|__________|              |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08
No difference proven at 95.0% confidence

T-test for order 4:

x order-2-stored
+ order-4-stored
+-----------------------------------------------------------------------------+
|                                                         +                   |
|+          +                     x  +x    xx  x +       ++   x              x|
|              |__________________|____A____M____M____________|_|             |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.4445158e+09 1.0129109e+10  1.001291e+10 9.8959249e+09 2.7947784e+08
No difference proven at 95.0% confidence

Next we establish that there is a statistically significant improvement
in `mem_used_total` metrics.

T-test for order 3:

x order-2-usedmem
+ order-3-usedmem
+-----------------------------------------------------------------------------+
|+         +        +       x ++        x  + xx x       +       x            x|
|        |_________________A__M__|____________|__A________________|           |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09      84630851
Difference at 95.0% confidence
	-9.98347e+07 +/- 9.21744e+07
	-3.28139% +/- 3.02961%
	(Student's t, pooled s = 7.91383e+07)

T-test for order 4:

x order-2-usedmem
+ order-4-usedmem
+-----------------------------------------------------------------------------+
|                    +                                 x                      |
|+                   +              +      x    ++ x   x *          x        x|
|             |__________________A__M__________|_____|_M__A__________|        |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08
Difference at 95.0% confidence
	-1.61028e+08 +/- 1.23591e+08
	-5.29272% +/- 4.0622%
	(Student's t, pooled s = 1.06111e+08)

Order 3 zspages also show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
+-----------------------------------------------------------------------------+
|+   +     + x+        x  +   + +             x                x    x        x|
|    |________M__A_________|_|_____________________A___________M____________| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09      39840377
Difference at 95.0% confidence
	-1.11047e+08 +/- 7.36589e+07
	-3.21017% +/- 2.12934%
	(Student's t, pooled s = 6.32415e+07)

Order 4 zspages, on the other hand, do not show any statistically significant
improvement in `mem_used_max` metrics.

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
+-----------------------------------------------------------------------------+
|+                 +           +   x     x +   +        x     +     *  x     x|
|              |_______________________A___M________________A_|_____M_______| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08
No difference proven at 95.0% confidence

Overall, with sufficient level of confidence order 3 zspages appear to be
beneficial for these particular use-case and data patterns.

Rather expectedly we also observed lower numbers of huge-pages when zsmalloc
is configured with order 3 and order 4 zspages, for the reason already
explained.

2) Synthetic test
-----------------------------------------------------------------------------

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted VM.

orig_data_size        mem_used_total      mem_used_max         pages_compacted
          compr_data_size         mem_limit           same_pages          huge_pages

ORDER 2 (BASE)

1691807744 628091753 655187968        0 655187968       59        0    34042    34043
1691803648 628089105 655159296        0 655159296       60        0    34043    34043
1691795456 628087429 655151104        0 655151104       59        0    34046    34046
1691799552 628093723 655216640        0 655216640       60        0    34044    34044

ORDER 3

1691787264 627781464 641740800        0 641740800       59        0    33591    33591
1691795456 627794239 641789952        0 641789952       59        0    33591    33591
1691811840 627788466 641691648        0 641691648       60        0    33591    33591
1691791360 627790682 641781760        0 641781760       59        0    33591    33591

ORDER 4

1691807744 627729506 639627264        0 639627264       59        0    33432    33432
1691820032 627731485 639606784        0 639606784       59        0    33432    33432
1691799552 627725753 639623168        0 639623168       59        0    33432    33433
1691820032 627734080 639746048        0 639746048       61        0    33432    33432

Order 3 and order 4 show statistically significant improvement in
`mem_used_total` metrics.

T-test for order 3:

x order-2-usedmem-comp
+ order-3-usedmem-comp
+-----------------------------------------------------------------------------+
|++                                                                          x|
|++                                                                          x|
|AM                                                                          A|
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4  6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08     29795.878
+   4 6.4169165e+08 6.4178995e+08 6.4178176e+08 6.4175104e+08         45056
Difference at 95.0% confidence
	-1.34277e+07 +/- 66089.8
	-2.04947% +/- 0.0100873%
	(Student's t, pooled s = 38195.8)

T-test for order 4:

x order-2-usedmem-comp
+ order-4-usedmem-comp
+-----------------------------------------------------------------------------+
|+                                                                           x|
|+                                                                           x|
|++                                                                          x|
|A|                                                                          A|
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4  6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08     29795.878
+   4 6.3960678e+08 6.3974605e+08 6.3962726e+08 6.3965082e+08     64101.637
Difference at 95.0% confidence
	-1.55279e+07 +/- 86486.9
	-2.37003% +/- 0.0132005%
	(Student's t, pooled s = 49984.1)

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem-comp
+ order-3-maxmem-comp
+-----------------------------------------------------------------------------+
|++                                                                          x|
|++                                                                          x|
|AM                                                                          A|
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4  6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08     29795.878
+   4 6.4169165e+08 6.4178995e+08 6.4178176e+08 6.4175104e+08         45056
Difference at 95.0% confidence
	-1.34277e+07 +/- 66089.8
	-2.04947% +/- 0.0100873%
	(Student's t, pooled s = 38195.8)

T-test for order 4:

x order-2-maxmem-comp
+ order-4-maxmem-comp
+-----------------------------------------------------------------------------+
|+                                                                           x|
|+                                                                           x|
|++                                                                          x|
|A|                                                                          A|
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4  6.551511e+08 6.5521664e+08 6.5518797e+08 6.5517875e+08     29795.878
+   4 6.3960678e+08 6.3974605e+08 6.3962726e+08 6.3965082e+08     64101.637
Difference at 95.0% confidence
	-1.55279e+07 +/- 86486.9
	-2.37003% +/- 0.0132005%
	(Student's t, pooled s = 49984.1)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

Data patterns that generate a considerable number of badly compressible
objects benefit from higher `huge_class_size` watermark, which is achieved
with order 4 zspages.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 26, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages,
which store objects of the same size. zspage can consist of up to four
physical pages. The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of
objects zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

1) ChromeOS memory pressure test
=============================================================================

Our standard memory pressure test, that is designed with reproducibility
in mind.

zram is configured as a swap device, lzo-rle compression algorithm.
We captured /sys/block/zram0/mm_stat after every test and rebooted
device.

Columns per (Documentation/admin-guide/blockdev/zram.rst)

orig_data_size        mem_used_total      mem_used_max         pages_compacted
          compr_data_size         mem_limit           same_pages          huge_pages

ORDER 2 (BASE) zspage

10353639424 2981711944 3166896128        0 3543158784   579494   825135   123707
10168573952 2932288347 3106541568        0 3499085824   565187   853137   126153
9950461952 2815911234 3035693056        0 3441090560   586696   748054   122103
9892335616 2779566152 2943459328        0 3514736640   591541   650696   119621
9993949184 2814279212 3021357056        0 3336421376   582488   711744   121273
9953226752 2856382009 3025649664        0 3512893440   564559   787861   123034
9838448640 2785481728 2997575680        0 3367219200   573282   777099   122739

ORDER 3 zspage

9509138432 2706941227 2823393280        0 3389587456   535856  1011472    90223
10105245696 2882368370 3013095424        0 3296165888   563896  1059033    94808
9531236352 2666125512 2867650560        0 3396173824   567117  1126396    88807
9561812992 2714536764 2956652544        0 3310505984   548223   827322    90992
9807470592 2790315707 2908053504        0 3378315264   563670  1020933    93725
10178371584 2948838782 3071209472        0 3329548288   548533   954546    90730
9925165056 2849839413 2958274560        0 3336978432   551464  1058302    89381

ORDER 4 zspage

9444515840 2613362645 2668232704        0 3396759552   573735  1162207    83475
10129108992 2925888488 3038351360        0 3499597824   555634  1231542    84525
9876594688 2786692282 2897006592        0 3469463552   584835  1290535    84133
10012909568 2649711847 2801512448        0 3171323904   675405   750728    80424
10120966144 2866742402 2978639872        0 3257815040   587435  1093981    83587
9578790912 2671245225 2802270208        0 3376353280   545548  1047930    80895
10108588032 2888433523 2983960576        0 3316641792   571445  1290640    81402

First, we establish that order 3 and 4 don't cause any statistically
significant change in `orig_data_size` (number of bytes we store during
the test), in other words larger zspages don't cause regressions.

T-test for order 3:

x order-2-stored
+ order-3-stored
+-----------------------------------------------------------------------------+
|+ +  +                     +  x   x  +  x   x         +    x+               x|
| |________________________AM__|_________M_____A____|__________|              |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08
No difference proven at 95.0% confidence

T-test for order 4:

x order-2-stored
+ order-4-stored
+-----------------------------------------------------------------------------+
|                                                         +                   |
|+          +                     x  +x    xx  x +       ++   x              x|
|              |__________________|____A____M____M____________|_|             |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.4445158e+09 1.0129109e+10  1.001291e+10 9.8959249e+09 2.7947784e+08
No difference proven at 95.0% confidence

Next we establish that there is a statistically significant improvement
in `mem_used_total` metrics.

T-test for order 3:

x order-2-usedmem
+ order-3-usedmem
+-----------------------------------------------------------------------------+
|+         +        +       x ++        x  + xx x       +       x            x|
|        |_________________A__M__|____________|__A________________|           |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09      84630851
Difference at 95.0% confidence
	-9.98347e+07 +/- 9.21744e+07
	-3.28139% +/- 3.02961%
	(Student's t, pooled s = 7.91383e+07)

T-test for order 4:

x order-2-usedmem
+ order-4-usedmem
+-----------------------------------------------------------------------------+
|                    +                                 x                      |
|+                   +              +      x    ++ x   x *          x        x|
|             |__________________A__M__________|_____|_M__A__________|        |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08
Difference at 95.0% confidence
	-1.61028e+08 +/- 1.23591e+08
	-5.29272% +/- 4.0622%
	(Student's t, pooled s = 1.06111e+08)

Order 3 zspages also show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
+-----------------------------------------------------------------------------+
|+   +     + x+        x  +   + +             x                x    x        x|
|    |________M__A_________|_|_____________________A___________M____________| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09      39840377
Difference at 95.0% confidence
	-1.11047e+08 +/- 7.36589e+07
	-3.21017% +/- 2.12934%
	(Student's t, pooled s = 6.32415e+07)

Order 4 zspages, on the other hand, do not show any statistically significant
improvement in `mem_used_max` metrics.

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
+-----------------------------------------------------------------------------+
|+                 +           +   x     x +   +        x     +     *  x     x|
|              |_______________________A___M________________A_|_____M_______| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08
No difference proven at 95.0% confidence

Overall, with sufficient level of confidence, order 3 zspages appear to be
beneficial for these particular use-case and data patterns.

Rather expectedly we also observed lower numbers of huge-pages when zsmalloc
is configured with order 3 and order 4 zspages, for the reason already
explained.

2) Synthetic test
=============================================================================

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
+--------------------------------------------------------------------------+
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|++                                                                       x|
|A|                                                                       A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
+--------------------------------------------------------------------------+
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|A                                                                        A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Signed-off-by: Sergey Senozhatsky <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 27, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages,
which store objects of the same size. zspage can consist of up to four
physical pages. The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of
objects zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Signed-off-by: Sergey Senozhatsky <[email protected]>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Oct 28, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100.  That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96 we
end up storing it in size class torvalds#100.  Class torvalds#100 is for objects of 1632
bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. 
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. 
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 29, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100.  That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96 we
end up storing it in size class torvalds#100.  Class torvalds#100 is for objects of 1632
bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. 
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. 
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 1, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 1, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 2, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 3, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiernan pushed a commit to zuma-array/linux that referenced this pull request Nov 3, 2022
driver defect clean up:
torvalds#40
torvalds#41
torvalds#99
torvalds#100
torvalds#395
torvalds#396
torvalds#475
torvalds#614
torvalds#669

Change-Id: I581aaa8a1b950278bbf74d0c94aa647de89e07a9
Signed-off-by: Evoke Zhang <[email protected]>
akiernan pushed a commit to zuma-array/linux that referenced this pull request Nov 4, 2022
driver defect clean up:
torvalds#40
torvalds#41
torvalds#99
torvalds#100
torvalds#395
torvalds#396
torvalds#475
torvalds#614
torvalds#669

Change-Id: I581aaa8a1b950278bbf74d0c94aa647de89e07a9
Signed-off-by: Evoke Zhang <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 5, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Nov 7, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Nov 8, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 9, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Mar 17, 2023
Currently, test_progs outputs all stdout/stderr as it runs, and when it
is done, prints a summary.

It is non-trivial for tooling to parse that output and extract meaningful
information from it.

This change adds a new option, `--json-summary`/`-J` that let the caller
specify a file where `test_progs{,-no_alu32}` can write a summary of the
run in a json format that can later be parsed by tooling.

Currently, it creates a summary section with successes/skipped/failures
followed by a list of failed tests and subtests.

A test contains the following fields:
- name: the name of the test
- number: the number of the test
- message: the log message that was printed by the test.
- failed: A boolean indicating whether the test failed or not. Currently
we only output failed tests, but in the future, successful tests could
be added.
- subtests: A list of subtests associated with this test.

A subtest contains the following fields:
- name: same as above
- number: sanme as above
- message: the log message that was printed by the subtest.
- failed: same as above but for the subtest

An example run and json content below:
```
$ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print
$1","}' | tr -d '\n') -j -J /tmp/test_progs.json
$ jq < /tmp/test_progs.json | head -n 30
{
  "success": 29,
  "success_subtest": 23,
  "skipped": 3,
  "failed": 28,
  "results": [
    {
      "name": "bpf_cookie",
      "number": 10,
      "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n",
      "failed": true,
      "subtests": [
        {
          "name": "multi_kprobe_link_api",
          "number": 2,
          "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0
nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not
resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed
to load BPF skeleton 'kprobe_multi':
-3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected
error: -3\n",
          "failed": true
        },
        {
          "name": "multi_kprobe_attach_api",
          "number": 3,
          "message": "libbpf: extern 'bpf_testmod_fentry_test1'
(strong): not resolved\nlibbpf: failed to load object
'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi':
-3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected
error: -3\n",
          "failed": true
        },
        {
          "name": "lsm",
          "number": 8,
          "message": "lsm_subtest:PASS:lsm.link_create 0
nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual
0 != expected -1\n",
          "failed": true
        }
```

The file can then be used to print a summary of the test run and list of
failing tests/subtests:

```
$ jq -r < /tmp/test_progs.json '"Success:
\(.success)/\(.success_subtest), Skipped: \(.skipped), Failed:
\(.failed)"'

Success: 29/23, Skipped: 3, Failed: 28
$ jq -r < /tmp/test_progs.json '.results | map([
    if .failed then "#\(.number) \(.name)" else empty end,
    (
        . as {name: $tname, number: $tnum} | .subtests | map(
            if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)"
else empty end
        )
    )
]) | flatten | .[]' | head -n 20
 torvalds#10 bpf_cookie
 torvalds#10/2 bpf_cookie/multi_kprobe_link_api
 torvalds#10/3 bpf_cookie/multi_kprobe_attach_api
 torvalds#10/8 bpf_cookie/lsm
 torvalds#15 bpf_mod_race
 torvalds#15/1 bpf_mod_race/ksym (used_btfs UAF)
 torvalds#15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF)
 torvalds#36 cgroup_hierarchical_stats
 torvalds#61 deny_namespace
 torvalds#61/1 deny_namespace/unpriv_userns_create_no_bpf
 torvalds#73 fexit_stress
 torvalds#83 get_func_ip_test
 torvalds#99 kfunc_dynptr_param
 torvalds#99/1 kfunc_dynptr_param/dynptr_data_null
 torvalds#99/4 kfunc_dynptr_param/dynptr_data_null
 torvalds#100 kprobe_multi_bench_attach
 torvalds#100/1 kprobe_multi_bench_attach/kernel
 torvalds#100/2 kprobe_multi_bench_attach/modules
 torvalds#101 kprobe_multi_test
 torvalds#101/1 kprobe_multi_test/skel_api
```

Signed-off-by: Manu Bretelle <[email protected]>
ammarfaizi2 pushed a commit to ammarfaizi2/linux-fork that referenced this pull request Mar 17, 2023
Currently, test_progs outputs all stdout/stderr as it runs, and when it
is done, prints a summary.

It is non-trivial for tooling to parse that output and extract meaningful
information from it.

This change adds a new option, `--json-summary`/`-J` that let the caller
specify a file where `test_progs{,-no_alu32}` can write a summary of the
run in a json format that can later be parsed by tooling.

Currently, it creates a summary section with successes/skipped/failures
followed by a list of failed tests and subtests.

A test contains the following fields:
- name: the name of the test
- number: the number of the test
- message: the log message that was printed by the test.
- failed: A boolean indicating whether the test failed or not. Currently
we only output failed tests, but in the future, successful tests could
be added.
- subtests: A list of subtests associated with this test.

A subtest contains the following fields:
- name: same as above
- number: sanme as above
- message: the log message that was printed by the subtest.
- failed: same as above but for the subtest

An example run and json content below:
```
$ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print
$1","}' | tr -d '\n') -j -J /tmp/test_progs.json
$ jq < /tmp/test_progs.json | head -n 30
{
  "success": 29,
  "success_subtest": 23,
  "skipped": 3,
  "failed": 28,
  "results": [
    {
      "name": "bpf_cookie",
      "number": 10,
      "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n",
      "failed": true,
      "subtests": [
        {
          "name": "multi_kprobe_link_api",
          "number": 2,
          "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
          "failed": true
        },
        {
          "name": "multi_kprobe_attach_api",
          "number": 3,
          "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
          "failed": true
        },
        {
          "name": "lsm",
          "number": 8,
          "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n",
          "failed": true
        }
```

The file can then be used to print a summary of the test run and list of
failing tests/subtests:

```
$ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"'

Success: 29/23, Skipped: 3, Failed: 28
$ jq -r < /tmp/test_progs.json '.results | map([
    if .failed then "#\(.number) \(.name)" else empty end,
    (
        . as {name: $tname, number: $tnum} | .subtests | map(
            if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end
        )
    )
]) | flatten | .[]' | head -n 20
 torvalds#10 bpf_cookie
 torvalds#10/2 bpf_cookie/multi_kprobe_link_api
 torvalds#10/3 bpf_cookie/multi_kprobe_attach_api
 torvalds#10/8 bpf_cookie/lsm
 torvalds#15 bpf_mod_race
 torvalds#15/1 bpf_mod_race/ksym (used_btfs UAF)
 torvalds#15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF)
 torvalds#36 cgroup_hierarchical_stats
 torvalds#61 deny_namespace
 torvalds#61/1 deny_namespace/unpriv_userns_create_no_bpf
 torvalds#73 fexit_stress
 torvalds#83 get_func_ip_test
 torvalds#99 kfunc_dynptr_param
 torvalds#99/1 kfunc_dynptr_param/dynptr_data_null
 torvalds#99/4 kfunc_dynptr_param/dynptr_data_null
 torvalds#100 kprobe_multi_bench_attach
 torvalds#100/1 kprobe_multi_bench_attach/kernel
 torvalds#100/2 kprobe_multi_bench_attach/modules
 torvalds#101 kprobe_multi_test
 torvalds#101/1 kprobe_multi_test/skel_api
```

Signed-off-by: Manu Bretelle <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
logic10492 pushed a commit to logic10492/linux-amd-zen2 that referenced this pull request Jan 18, 2024
Fix and update semantics for ops.enable() and ops.disable()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant