-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak "ExAllocatePoolWithTag failed" in spl-seg_kmem.c, line 134
#283
Comments
I also just noticed from the time in the screenshot that this happened shortly after I started the copy. |
With kstat, you can just run it after some use, say 2-3TB of data. Since nothing should "grow forever" inside kmem, leaks tend to stand out quite a bit. If its not an internal leak, it could be we aren't releasing something between us and windows, like when closing files or whatnot |
From the screenshot it looks like it takes about 2-3 minutes for the memory usage to go up significantly enough to cause out of memory, so I'd have to watch it all the time to catch it. I made a script now that logs kstat every 10 seconds so I can see what the last file is. |
I ran CrystalDisk.exe, and then |
From what I could find CrystalDiskMark is a wrapper around DiskSpd https://github.com/ayavilevich/DiskSpdAuto You might want to look into this if you're planning on making some sort of test. |
i am not, just checking if I could make enough IO to show any leaks :) |
Of course when I try to reproduce the issue it doesn't happen :\ It ran all night, although a bit slow. |
OK so this So that we occasionally get NULL isn't indicative of a problem on its own, unless it happens a lot, or quickly. If it keeps happening then we probably do have a leak, if reaping never releases enough memory. The Windows memory pressure work is here: You can also issue pressure events with kstat, although I've not tried it on Windows, but the code hasn't changed. |
cbuf.txt looks interesting |
What makes me think this is an issue is that this is definitely new behaviour. I never encountered this before doing a lot of copy runs, now I encounter it almost every time. |
There are a bunch of stacks like this:
|
|
Above comment was about the ASSERT itself, which I should not have left in there :) Yeah, it is not doing well there. kmem seems quite wedged, which is shown in the stack. cbuf is most peculiar. More aggressive reaction to memory pressure would be nice I think. |
Can I get any more info from this state? It is very close to dying, I saw the memory usage go up very close to maximum. I think it will be difficult to catch it there again without some automation. In any case I captured a memory dump. |
I let it run for a few more seconds, memory usage just keeps rapidly increasing, and rclone is completely stalled, can't even terminate the process. I think I'll kill this now and get the kstat logs. I'll try to make a tool I can use to automate breaking if free memory gets low. Edit: I also checked cbuf again, just more of the same. |
The last kstat output_195013.log kstat logs taken every 10 seconds: kstat_logs.zip Edit: Next time I need to do this I should log these to a network drive. |
So the biggest magazines are:
Most of those are as expected, but |
that bucket is also in cbuf a lot:
|
What I don't understand with this is that normally memory usage is very stable, the VM has 16GB of memory, and normally there is around 4GB available during the test. But then at some point the memory usage is rising above that and I already know it will crash now. |
I tried with the recent change to remove the assert, now it just sits there for a while, seemingly unable to recover and then the debugger breaks like this:
|
|
|
OK the low memory events are firing, so that is something. We could trigger another when alloc gets NULL. Potentially tho, it does seem like a leak. Since we cant reap if they are leaking. We actually dump out leak memory on module unload. That could be worth checking as well. |
I think I should test with the last releases and when I have a good commit I'll do a bisect to find where this started occurring. How can I get the module to unload? |
I believe this some information about how to unload the driver |
yeah looks like it was doing the right thing... then something goes quirky and it forgets to stay under our imposed limit |
OK also need to remove |
I think I haven't actually installed the new driver 🤦🏻 I'll have to pay attention to this better. I figured out why |
Basically you need to call |
I usually just "save" one of the cmakelist.txt files, then it regenerates |
fc741be seems to seriously decrease my write performance, from 200MB/s to 10-20MB/s. I tried hardcoding |
With the |
=512 was a hack, I'll change it to pull out whatever recordsize is set. |
Oh, it should be set to the recordsize? I use 1MB for that. |
1ecc0ca |
For some reason |
huh that is surprising. We need to try to fish out the recordsize then |
|
|
I used |
I'll try that. |
Moving |
I also tried setting What is interesting is that transfers start out fast for a few seconds but then slow down to a crawl and remain that way. |
so perhaps it should hold the value of ashift in this case |
I do think so too. I'll try some other values > 4096, I want to know if 8192 slows down too. |
8192 seems to be the same speed to me but at 16384 it already is a bit slower. |
OK so need to figure out how to get the ashift value |
well, that's not easy then, hmm.. |
|
Having issues convincing the upstream that 512 vs 4096 makes any IO speed differences. Do you have speed graphs for both? |
I just logged iostat_512.txt Note that in the 512 byte case iostat reports ~80-100MB/s IO with a high write operation count, but rclone itself only reports 5-10MB/s. What I also noticed is that it takes a long time to even get the rclone process to abort the copy. In the 4096 case iostat reports ~160-240MB/s and a much lower write operation count, which is about what I would expect and have seen before 1ecc0ca. rclone itself also reports a very similar speed range as iostat during the copy. I can also abort rclone within seconds instead of minutes. |
Created #318 to continue discussion there. |
When testing #281 I noticed that when copying a 5TB dataset using rclone it always ends in an allocation failure:
so
ExAllocatePoolWithTag failed
memory.txt
This seems like a new issue because I was still able to copy the full dataset not that long ago.
I'll try to get some
kstat
information. Is it possible to getkstat
info from the debugger when the failure has already happened? I could also try logging it periodically to a file.The text was updated successfully, but these errors were encountered: