Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak "ExAllocatePoolWithTag failed" in spl-seg_kmem.c, line 134 #283

Closed
EchterAgo opened this issue Oct 11, 2023 · 192 comments
Closed

Comments

@EchterAgo
Copy link

When testing #281 I noticed that when copying a 5TB dataset using rclone it always ends in an allocation failure:

*** Assertion failed: 0
***   Source File: H:\dev\openzfs\module\os\windows\spl\spl-seg_kmem.c, line 134

so ExAllocatePoolWithTag failed

memory.txt

image

This seems like a new issue because I was still able to copy the full dataset not that long ago.

I'll try to get some kstat information. Is it possible to get kstat info from the debugger when the failure has already happened? I could also try logging it periodically to a file.

@EchterAgo
Copy link
Author

I also just noticed from the time in the screenshot that this happened shortly after I started the copy.

@lundman
Copy link

lundman commented Oct 11, 2023

With kstat, you can just run it after some use, say 2-3TB of data. Since nothing should "grow forever" inside kmem, leaks tend to stand out quite a bit.

If its not an internal leak, it could be we aren't releasing something between us and windows, like when closing files or whatnot

@EchterAgo
Copy link
Author

From the screenshot it looks like it takes about 2-3 minutes for the memory usage to go up significantly enough to cause out of memory, so I'd have to watch it all the time to catch it. I made a script now that logs kstat every 10 seconds so I can see what the last file is.

@lundman
Copy link

lundman commented Oct 12, 2023

I ran CrystalDisk.exe, and then grep inuse kstat.txt but its pretty inconclusive. Biggest is of course the "total allocated from windows".

@andrewc12
Copy link

From what I could find CrystalDiskMark is a wrapper around DiskSpd

https://github.com/ayavilevich/DiskSpdAuto

You might want to look into this if you're planning on making some sort of test.

@lundman
Copy link

lundman commented Oct 12, 2023

i am not, just checking if I could make enough IO to show any leaks :)

@EchterAgo
Copy link
Author

Of course when I try to reproduce the issue it doesn't happen :\ It ran all night, although a bit slow.

@lundman
Copy link

lundman commented Oct 12, 2023

OK so this Source File: H:\dev\openzfs\module\os\windows\spl\spl-seg_kmem.c, line 134 isn't necessarily a problem, I have an ASSERT in there, as I wanted to see when it happens. Generally, all zfs/kmem allocs come into this function, where we allocate a large bit of memory, then kmem will carve it up into magazines etc, and dish it out, internally to ZFS.
We are supposed to detect memory pressure, and if we have pressure, the kmem reaper will walk around, and release memory back to Windows.
So if the pressure detection is poor, or perhaps, a bit slow, then we can be too slow in reaping memory back, and get NULL allocs. kmem can handle NULL allocs, so eventually the ASSERT is to be removed.
You can see the same path for macOS here:
https://github.com/openzfsonosx/openzfs-fork/blob/macOS/module/os/macos/spl/spl-seg_kmem.c#L207

So that we occasionally get NULL isn't indicative of a problem on its own, unless it happens a lot, or quickly. If it keeps happening then we probably do have a leak, if reaping never releases enough memory.

The Windows memory pressure work is here:

https://github.com/openzfsonwindows/openzfs/blob/windows/module/os/windows/spl/spl-kmem.c#L4887-L4931

You can also issue pressure events with kstat, although I've not tried it on Windows, but the code hasn't changed.
This is all code that needs to be tweaked to a good default once it is stable enough for it to matter :) Which I guess we are close to now.

@EchterAgo
Copy link
Author

Ok, I reproduced the condition again, and I just managed to stop the debugger just before running out of memory. The copying actually stalled completely. I'll try to get what I can out of the debugger and then restart the machine to get the kstat logs.

image

@EchterAgo
Copy link
Author

EchterAgo commented Oct 12, 2023

cbuf.txt looks interesting

@EchterAgo
Copy link
Author

OK so this Source File: H:\dev\openzfs\module\os\windows\spl\spl-seg_kmem.c, line 134 isn't necessarily a problem, I have an ASSERT in there, as I wanted to see when it happens. Generally, all zfs/kmem allocs come into this function, where we allocate a large bit of memory, then kmem will carve it up into magazines etc, and dish it out, internally to ZFS. We are supposed to detect memory pressure, and if we have pressure, the kmem reaper will walk around, and release memory back to Windows. So if the pressure detection is poor, or perhaps, a bit slow, then we can be too slow in reaping memory back, and get NULL allocs. kmem can handle NULL allocs, so eventually the ASSERT is to be removed. You can see the same path for macOS here: https://github.com/openzfsonosx/openzfs-fork/blob/macOS/module/os/macos/spl/spl-seg_kmem.c#L207

So that we occasionally get NULL isn't indicative of a problem on its own, unless it happens a lot, or quickly. If it keeps happening then we probably do have a leak, if reaping never releases enough memory.

The Windows memory pressure work is here:

https://github.com/openzfsonwindows/openzfs/blob/windows/module/os/windows/spl/spl-kmem.c#L4887-L4931

You can also issue pressure events with kstat, although I've not tried it on Windows, but the code hasn't changed. This is all code that needs to be tweaked to a good default once it is stable enough for it to matter :) Which I guess we are close to now.

What makes me think this is an issue is that this is definitely new behaviour. I never encountered this before doing a lot of copy runs, now I encounter it almost every time.

@EchterAgo
Copy link
Author

There are a bunch of stacks like this:

2a88.001b78  ffff953882193080 0000001 Blocked    nt!KiSwapContext+0x76
                                        nt!KiSwapThread+0x500
                                        nt!KiCommitThreadWait+0x14f
                                        nt!KeWaitForMultipleObjects+0x2be
                                        OpenZFS!cv_timedwait_hires+0x1a0
                                        OpenZFS!vmem_bucket_alloc+0x703
                                        OpenZFS!vmem_xalloc+0xe51
                                        OpenZFS!vmem_alloc_impl+0x3cf
                                        OpenZFS!vmem_xalloc+0xe51
                                        OpenZFS!vmem_alloc_impl+0x3cf
                                        OpenZFS!vmem_xalloc+0xe51
                                        OpenZFS!vmem_alloc_impl+0x3cf
                                        OpenZFS!kmem_slab_create+0x12d
                                        OpenZFS!kmem_slab_alloc+0xae
                                        OpenZFS!kmem_cache_alloc+0x437
                                        OpenZFS!vmem_alloc_impl+0x10b
                                        OpenZFS!vmem_xalloc+0xe51
                                        OpenZFS!vmem_alloc_impl+0x3cf
                                        OpenZFS!kmem_slab_create+0x12d
                                        OpenZFS!kmem_slab_alloc+0xae
                                        OpenZFS!kmem_cache_alloc+0x437
                                        OpenZFS!zio_data_buf_alloc+0xa6
                                        OpenZFS!zil_itx_create+0x3f
                                        OpenZFS!zfs_log_write+0x32c
                                        OpenZFS!zfs_write+0x16fb
                                        OpenZFS!zfs_write_wrap+0xefb
                                        OpenZFS!fs_write_impl+0x364
                                        OpenZFS!fs_write+0x47e
                                        OpenZFS!fsDispatcher+0x187b
                                        OpenZFS!dispatcher+0x292
                                        nt!IofCallDriver+0x55
                                        +0xfffff8061125710f
                                        +0x2

stacks.txt

@EchterAgo
Copy link
Author

!locks for this thread:

Resource @ 0xffff8a8e139f4c40    Exclusively owned
     Threads: ffff953882193080-01<*> 

Resource @ 0xffff8a8e139f4ca8    Shared 1 owning threads
     Threads: ffff953882193080-01<*> 
KD: Scanning for held locks..

Resource @ 0xffff8a8e139f3a08    Exclusively owned
     Threads: ffff95387fbc8080-01<*> 

Resource @ 0xffff8a8e139f3a70    Shared 1 owning threads
     Threads: ffff95387fbc8080-01<*> 
KD: Scanning for held locks.

Resource @ 0xffff8a8e139f31c8    Exclusively owned
     Threads: ffff9538823e1080-01<*> 

Resource @ 0xffff8a8e139f3230    Shared 1 owning threads
     Threads: ffff9538823e1080-01<*> 
KD: Scanning for held locks.

Resource @ 0xffff8a8e139f2988    Exclusively owned
     Threads: ffff9538823ce080-01<*> 

Resource @ 0xffff8a8e139f29f0    Shared 1 owning threads
     Threads: ffff9538823ce080-01<*> 
109192 total locks, 8 locks currently held

@lundman
Copy link

lundman commented Oct 12, 2023

Above comment was about the ASSERT itself, which I should not have left in there :)

Yeah, it is not doing well there. kmem seems quite wedged, which is shown in the stack. cbuf is most peculiar. More aggressive reaction to memory pressure would be nice I think.

@EchterAgo
Copy link
Author

Can I get any more info from this state? It is very close to dying, I saw the memory usage go up very close to maximum. I think it will be difficult to catch it there again without some automation. In any case I captured a memory dump.

@EchterAgo
Copy link
Author

EchterAgo commented Oct 13, 2023

I let it run for a few more seconds, memory usage just keeps rapidly increasing, and rclone is completely stalled, can't even terminate the process. I think I'll kill this now and get the kstat logs. I'll try to make a tool I can use to automate breaking if free memory gets low.

Edit: I also checked cbuf again, just more of the same.

@EchterAgo
Copy link
Author

EchterAgo commented Oct 13, 2023

The last kstat output_195013.log

kstat logs taken every 10 seconds: kstat_logs.zip

Edit: Next time I need to do this I should log these to a network drive.

@lundman
Copy link

lundman commented Oct 13, 2023

# grep inuse output_195013.log | sort -k2n

        mem_inuse                       772014080
        mem_inuse                       4348411904
        mem_inuse                       4849979392
        mem_inuse                       4850667520
        mem_inuse                       5224484864
        mem_inuse                       5224484864

So the biggest magazines are:

name:   spl_default_arena               class:    vmem
        mem_inuse                       772014080

name:   bucket_32768                    class:    vmem
        mem_inuse                       4348411904

name:   bucket_heap                     class:    vmem
        mem_inuse                       5224484864

name:   heap                            class:    vmem
        mem_inuse                       5224484864

name:   kmem_metadata                   class:    vmem
        mem_inuse                       363937792

name:   kmem_va                         class:    vmem
        mem_inuse                       4850667520

name:   kmem_default                    class:    vmem
        mem_inuse                       4849979392

Most of those are as expected, but bucket_32768 is weirdly larger than all the others at 4Gb, the next size is bucket_1048576 at 280Mb.

@EchterAgo
Copy link
Author

that bucket is also in cbuf a lot:

FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF953882193080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF953882193080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF953882193080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF953882193080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823E1080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768
FFFF9538823CE080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 32768
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 65536
FFFF95387FBC8080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_32768)(steps: 0) p->vs_start, end == 0, 32768

@EchterAgo
Copy link
Author

What I don't understand with this is that normally memory usage is very stable, the VM has 16GB of memory, and normally there is around 4GB available during the test. But then at some point the memory usage is rising above that and I already know it will crash now.

@EchterAgo
Copy link
Author

Just happened again, this time very fast, the horizontal axis is a minute in total, so in under half a minute it suddenly rose smoothly, by 4 gigabyte, after staying stable for 54 minutes, then it crashed.

image

@EchterAgo
Copy link
Author

I tried with the recent change to remove the assert, now it just sits there for a while, seemingly unable to recover and then the debugger breaks like this:

Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Without a debugger attached, the following bugcheck would have occurred.
  eb 0000000000011C94 0000000000000000 1 c0000054
Break instruction exception - code 80000003 (first chance)
nt!MiNoPagesLastChance+0x1b8:
fffff807`76350b70 cc              int     3

@EchterAgo
Copy link
Author

4: kd> k
 # Child-SP          RetAddr               Call Site
00 ffff9303`e82756c0 fffff807`7635c575     nt!MiNoPagesLastChance+0x1b8
01 ffff9303`e8275790 fffff807`7622d213     nt!MiWaitForFreePage+0x189
02 ffff9303`e8275890 fffff807`7620d1d8     nt!MmAccessFault+0x1fcee3
03 ffff9303`e8275a30 fffff807`76133590     nt!KiPageFault+0x358
04 ffff9303`e8275bc8 fffff807`761128f0     nt!RtlDecompressBufferXpressLz+0x50
05 ffff9303`e8275be0 fffff807`760e55c4     nt!RtlDecompressBufferEx+0x60
06 ffff9303`e8275c30 fffff807`760e5451     nt!ST_STORE<SM_TRAITS>::StDmSinglePageCopy+0x150
07 ffff9303`e8275cf0 fffff807`760e5d7c     nt!ST_STORE<SM_TRAITS>::StDmSinglePageTransfer+0xa5
08 ffff9303`e8275d40 fffff807`7611412c     nt!ST_STORE<SM_TRAITS>::StDmpSinglePageRetrieve+0x180
09 ffff9303`e8275de0 fffff807`76113f79     nt!ST_STORE<SM_TRAITS>::StDmPageRetrieve+0xc8
0a ffff9303`e8275e90 fffff807`76113e31     nt!SMKM_STORE<SM_TRAITS>::SmStDirectReadIssue+0x85
0b ffff9303`e8275f10 fffff807`76072228     nt!SMKM_STORE<SM_TRAITS>::SmStDirectReadCallout+0x21
0c ffff9303`e8275f40 fffff807`76115307     nt!KeExpandKernelStackAndCalloutInternal+0x78
0d ffff9303`e8275fb0 fffff807`760e0c7c     nt!SMKM_STORE<SM_TRAITS>::SmStDirectRead+0xc7
0e ffff9303`e8276080 fffff807`760e06b0     nt!SMKM_STORE<SM_TRAITS>::SmStWorkItemQueue+0x1ac
0f ffff9303`e82760d0 fffff807`76114567     nt!SMKM_STORE_MGR<SM_TRAITS>::SmIoCtxQueueWork+0xc0
10 ffff9303`e8276160 fffff807`76157d9f     nt!SMKM_STORE_MGR<SM_TRAITS>::SmPageRead+0x167
11 ffff9303`e82761d0 fffff807`76095900     nt!SmPageRead+0x33
12 ffff9303`e8276220 fffff807`76094b1d     nt!MiIssueHardFaultIo+0x10c
13 ffff9303`e8276270 fffff807`76030798     nt!MiIssueHardFault+0x29d
14 ffff9303`e8276330 fffff807`7620d1d8     nt!MmAccessFault+0x468
15 ffff9303`e82764d0 fffff807`73e84b8f     nt!KiPageFault+0x358
16 ffff9303`e8276660 fffff807`76a50d40     0xfffff807`73e84b8f
17 ffff9303`e8276668 00007ffb`c93e89f0     nt!MiSystemPartition
18 ffff9303`e8276670 00000000`00000000     0x00007ffb`c93e89f0

@EchterAgo
Copy link
Author

cbuf.txt

@EchterAgo
Copy link
Author

FFFFCD814BA6D080: osif_malloc:134: ExAllocatePoolWithTag failed (memusage: 12818907136)
FFFFCD814BA6D080: SPL: vmem_xalloc: vmem waiting for 32768 sized alloc for spl_default_arena, waiting threads 1, total threads waiting = 1
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7992:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_CANT_WAIT
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:8006:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_SUCCESS
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7992:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_CANT_WAIT
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:8006:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_SUCCESS
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7992:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_CANT_WAIT
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:8006:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_SUCCESS
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7992:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_CANT_WAIT
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:8006:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_SUCCESS
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7992:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_CANT_WAIT
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:8006:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: returning STATUS_SUCCESS
FFFFCD8106295040: dprintf: zfs_vnops_windows.c:7961:fastio_acquire_for_mod_write(): fastio_acquire_for_mod_write: 
FFFFCD8106295040: osif_malloc:134: ExAllocatePoolWithTag failed (memusage: 12818907136)
FFFFCD8106295040: SPL: vmem_xalloc: vmem waiting for 32768 sized alloc for spl_default_arena, waiting threads 2, total threads waiting = 2
FFFFCD8106295040: SPL: vmem_xalloc: pressure 65536 targeted, 65536 delivered
FFFFCD814BA6D080: SPL: vmem_xalloc: pressure 32768 targeted, 65536 delivered
FFFFCD81066E3040: dprintf: arc_os.c:472:arc_reclaim_thread(): ZFS: arc growtime expired
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFCD81066E3040: dprintf: arc_os.c:472:arc_reclaim_thread(): ZFS: arc growtime expired
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)
FFFFD833E0978040: spl_event_thread: LOWMEMORY EVENT *** 0x0 (memusage: 12818907136)

@lundman
Copy link

lundman commented Oct 15, 2023

OK the low memory events are firing, so that is something. We could trigger another when alloc gets NULL.

Potentially tho, it does seem like a leak. Since we cant reap if they are leaking.

We actually dump out leak memory on module unload. That could be worth checking as well.

@EchterAgo
Copy link
Author

I think I should test with the last releases and when I have a good commit I'll do a bisect to find where this started occurring. How can I get the module to unload?

@andrewc12
Copy link

I believe this some information about how to unload the driver
#287

@lundman
Copy link

lundman commented Nov 4, 2023

yeah looks like it was doing the right thing... then something goes quirky and it forgets to stay under our imposed limit

@lundman
Copy link

lundman commented Nov 4, 2023

OK also need to remove KMC_ARENA_SLAB from abd_os.c abd_cache create * 2.

@EchterAgo
Copy link
Author

I think I haven't actually installed the new driver 🤦🏻

I'll have to pay attention to this better. I figured out why zfs_gitrev.h is not updated on each commit, will make a PR for this soon.

@EchterAgo
Copy link
Author

Basically you need to call get_git_head_revision so that grabRef.cmake is created. Then when the commit changes it will reconfigure.

@lundman
Copy link

lundman commented Nov 4, 2023

I usually just "save" one of the cmakelist.txt files, then it regenerates

@EchterAgo
Copy link
Author

EchterAgo commented Nov 4, 2023

fc741be seems to seriously decrease my write performance, from 200MB/s to 10-20MB/s. I tried hardcoding blksz to 4096 and I get the expected speeds again.

@EchterAgo
Copy link
Author

With the blksz=4096 adjustment it has been going at ~200MB/s for 2.6TB already, I believe this is fixed so closing. @lundman should I create another issue for the performance issue? I'll also give this another test, maybe something else was wrong.

@lundman
Copy link

lundman commented Nov 4, 2023

=512 was a hack, I'll change it to pull out whatever recordsize is set.

@EchterAgo
Copy link
Author

Oh, it should be set to the recordsize? I use 1MB for that.

@lundman
Copy link

lundman commented Nov 4, 2023

1ecc0ca
OK, that appears to run acceptably here.

@EchterAgo
Copy link
Author

EchterAgo commented Nov 4, 2023

For some reason zfs_blksz() also returns 512. The filesystem is created with ashift=12 and the underlying block size is 4k. This does not come from the fallback return (512ULL); in zfs_blksz but from sa_object_size

@lundman
Copy link

lundman commented Nov 4, 2023

huh that is surprising. We need to try to fish out the recordsize then

@EchterAgo
Copy link
Author

EchterAgo commented Nov 4, 2023

zp->z_zfsvfs->z_max_blksz in zfs_blksz does contain the 1MB that is set for recordsize. Not sure whether that is the actual recordsize though.

@lundman
Copy link

lundman commented Nov 4, 2023

os->os_recordsize should have it, but lets move zp->z_zfsvfs->z_max_blks above the sa_object_size then?

@EchterAgo
Copy link
Author

I used zpool create -o ashift=12 -O atime=off -O compression=lz4 -O recordsize=1MB -O casesensitivity=insensitive -O normalization=formD Data raidz1 PHYSICALDRIVE0 PHYSICALDRIVE1 PHYSICALDRIVE2 PHYSICALDRIVE3 cache PHYSICALDRIVE4 PHYSICALDRIVE5 to create this pool.

@EchterAgo
Copy link
Author

os->os_recordsize should have it, but lets move zp->z_zfsvfs->z_max_blks above the sa_object_size then?

I'll try that.

@EchterAgo
Copy link
Author

EchterAgo commented Nov 4, 2023

Moving zp->z_zfsvfs->z_max_blksz up does make zfs_blksz return my record size. I confirmed it by increasing it to 2MB and checking the returned value. However things are even slower than when 512 was returned, 5MB/s.

@EchterAgo
Copy link
Author

EchterAgo commented Nov 5, 2023

It also now uses 50% of the 12 CPU cores that I gave it while only transferring 5MB/s

image

@EchterAgo
Copy link
Author

I also tried setting blksz directly to zp->z_zfsvfs->z_max_blksz instead of modifying zfs_blksz, but same outcome, slow transfers.

What is interesting is that transfers start out fast for a few seconds but then slow down to a crawl and remain that way.

@lundman
Copy link

lundman commented Nov 5, 2023

so perhaps it should hold the value of ashift in this case

@EchterAgo
Copy link
Author

so perhaps it should hold the value of ashift in this case

I do think so too. I'll try some other values > 4096, I want to know if 8192 slows down too.

@EchterAgo
Copy link
Author

8192 seems to be the same speed to me but at 16384 it already is a bit slower.

@lundman
Copy link

lundman commented Nov 5, 2023

OK so need to figure out how to get the ashift value

@lundman
Copy link

lundman commented Nov 5, 2023

well, that's not easy then, hmm..

@lundman
Copy link

lundman commented Nov 5, 2023

vdev_get_min_asize(zfsvfs->z_os->os_spa->spa_root_vdev)
but that is all kinds of gross. I'll ask if anyone happens to know what the deal is.

@lundman
Copy link

lundman commented Nov 5, 2023

Having issues convincing the upstream that 512 vs 4096 makes any IO speed differences. Do you have speed graphs for both?

@EchterAgo
Copy link
Author

I just logged zpool iostat 5 for an 8 threaded rclone copy to a newly created pool and it seems like there is some write amplification going on when blksz is not set to 4096.

iostat_512.txt
iostat_4096.txt

Note that in the 512 byte case iostat reports ~80-100MB/s IO with a high write operation count, but rclone itself only reports 5-10MB/s. What I also noticed is that it takes a long time to even get the rclone process to abort the copy.

In the 4096 case iostat reports ~160-240MB/s and a much lower write operation count, which is about what I would expect and have seen before 1ecc0ca. rclone itself also reports a very similar speed range as iostat during the copy. I can also abort rclone within seconds instead of minutes.

@EchterAgo
Copy link
Author

Created #318 to continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants