-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Overwrite #11681
Comments
One thing to try is to create a debug build of mpv with the AddressSanitizer enabled and run with that build for a while. In issue #11626 I showed how I enabled the sanitizer when building using mpv-build: printf "%s\n" -Dbuildtype=debug > mpv_options
printf "%s\n" -Db_sanitize=address,undefined >> mpv_options If the problem reproduces the address sanitizer may be able to identify the offending code. |
@low-batt: Well! I got my first crash since compiling with the address sanitizer. But is this the same crash? What parts of the report are relevant? The address sanitizer, the crashed thread, or all threads from 0 up to the crashed thread?
|
On whether the root cause of this crash is the same defect that caused the previous crash… We need to track down and be able to explain the defective code to say for sure. We suspect a code defect is causing a memory overwrite. Memory overwrites can manifest in all kinds of different ways. On what parts of the crash report are relevant… In general this depends upon the kind of software problem. Some problems are obvious from just the stack traceback of the thread that crashed. For other problems it is desirable to have all of the available evidence. For example, looking over what else is going on with the other threads may show a pattern that provides a clue to the root cause. Back to the crash at hand… Did this crash also occur while mpv was quitting? This part is very interesting:
This is the thread that crashed:
The following is from the crash report in issue #10986:
The crash report in #10986 is suggestive of a problem with coordinating shutdown of the OpenGL context. Looking at the code for mpv_render_context_free shows a lot of code that is intended to properly coordinate freeing of the context. Possibly that code is not quite correct and there is a thread race condition lurking that is responsible for the problem at hand. This is just speculation. Remember I am not a mpv developer. Continue to run with sanity checking enabled. Lets see if we get another crash report that provides additional clues. |
New incident. Don't know if it's the same one. Happened when quitting at the end of a video.
|
Yeah it seems to be the same root cause. If this is indeed a bug with libmpv, I wonder if it can be reproduced on other platforms? |
Don't the reports pinpoint (or at least narrow down considerably) the culprit? It has happened at least three times the last couple of weeks, and I think it was always:
Though this may not be necessary factors for reproduction. PS! Is it possible in some way to replay a mpv session, like tool-assisted speedruns? (Though this may not trigger the same issue every time because of the arbitrary nature of memory allocation.) PPS! Then again, is there any way to sandbox an application, thereby forcing the exact same execution every time? Like a docker container, VirtualBox, etc. Maybe this is overkill, and still impossible to control, but interesting to know. |
Yes. I'm not an mpv dev though, so besides indicating a bug somewhere in I wonder if also compiling with https://clang.llvm.org/docs/ThreadSanitizer.html
This is what https://rr-project.org/ is supposed to do, but I don't think such tools work on mac due to reasons mentioned in https://joneschrisg.wordpress.com/2015/01/29/rr-on-os-x-can-it-be-ported/
Normal sandboxing is not sufficient, since if it's a race condition then the timing with which memory accesses occur has an impact on the state, and to my knowledge I don't believe there's a way to perfectly ensure same state (would love to know if there is such a tool though!) I think the best equivalent is using a race detector like TSan, since it "kind of" works around the memory access timing issue because it can instrument every memory access to do checking. |
Sorry I've been slient. Super busy these days. On this:
Yes and no. These two lines of the stack trace:
Show mpv gl_tex_destroy calling OpenGL glDeleteTextures. From OpenGL Context:
Under macOS "thread-local variable" corresponds to NSThread.threadDictionary. Using thread local storage avoids the need to add a So one concern is the state of the OpenGL context. In the crash reports so far we have seen the main thread crash as well as a thread from the CA DispatchGroup. I took @krackers advice and built a version of mpv with TSan enabled. That showed a problem in the logger at startup, which may explain a startup hang that I have experienced. I also ran into a reproducible crash in the playback core. Not sure what that is about. As the crashes are happening in OpenGL code, not mpv code, it limits the amount of help some of these santizers can provide. TSan playback core crash:
I tried to reproduce this problem using mpv built with the address sanitizer and have only managed to trigger it once. I was using variations on the follow script to test mpv shutdown. startQuitLoop.sh:#!/bin/bash
while true; do
echo Starting mpv
/Users/low-batt/Documents/builds/mpv-build/mpv-build/mpv/build/mpv --autofit=30% --config=no --input-ipc-server=/tmp/mpv-socket --mute ~/Movies/Airplaneski.mp4 &
sleep 2
echo Instructing mpv to Quit
echo '{ "command": ["quit"] }' | socat - /tmp/mpv-socket | jq
sleep 2
done That detected the same problem reported here: Address sanitizer abort:
I'm expecting this requires the attention of a mpv developer with knowledge of different subsystems and how shutdown is supposed to be coordinated. |
Thank you for chiming in low-batt, it is always a pleasure to read your posts!
Have all instances of the crash been with people using OSX on ARM machines? On those machines OpenGL commands go through a translation layer, and it is known that it is more strict about requiring serialization for context access (see glfw/glfw#1997 (comment)) I wonder if it is a similar issue here? |
Although this OpenGL requirement from OpenGL Programming Guide for Mac is very important:
The situation at hand is worse. This is during shutdown, so single threading context access is not enough. All systems must be coordinated to ensure no thread tries to use the context after it has been shutdown. So even if access is single threaded it would be like a thread trying to access a file after another thread closed the channel. The mpv code tries to coordinate this. But it could be some coordination is missing or something is not quite correct in the thread coordination. This feels like a thread race condition with a very small window of vulnerability. With that kind of problem the failure is usually, but not always, more likely to occur when running on hardware that is actually concurrently running threads on multiple CPU cores. That may explain why this is being encountered when running on Apple Silicon with its many cores. |
Made me think about this some more. I'm thinking your comment:
May have nailed it. The Apple documentation doesn't really spell this out, as in explain what can go wrong if you do your own locking. I'm thinking the one crash we have seen outside of the main thread may be an example. This stack trace:
Seems to indicate Quartz is performing some operations asynchronously. If the Core Graphics framework is asynchronously accessing the OpenGL context then a lock declared at the application level will be ineffective in single threading access to the context. That means The failure to use these locking methods for single threading access to the OpenGL context may be the root cause of the crashes reported in this issue. IINA experienced crashes in |
Yup, that makes perfect sense to me, and would fit with what's been observed! Nice debugging work! |
I was curious and took a look at disassembly of CGLLockContext:
So it is basically just a regular pthread mutex lock, I guess as you mentioned the key is that Apple's own libraries (e.g. Quartz) make use of this, so if you want to avoid race-conditions with Apple's own libraries trying to access the same context, then you must make use of |
Yes, exactly. The mpv code locked the side door and the asynchronous Quartz framework code walked in the front door. All the code needs to be using the same lock. At least this sure seems like the issue. Frustrating that it is not easy to find detailed documentation regarding this. I always like to know the reasons behind requirements and what can go wrong if they are not followed. I'm expecting there isn't a lot of discussion on the net about this because most coders use |
Had another one of these |
Yet another one |
Had another one of these on a different Mac/macOS:
|
this fixes a crash on quit, when a CATransaction from a system owned thread/event is happening at the same time. locking the context synchronises these access and prevents the race condition. the draw operation induced by any display call from the CAOpenGLLayer doesn't need that lock, since the display function already does lock that current context. Fixes mpv-player#11681
Great work, everyone. I've applied the patch and will notify if it happens again (though it's been very rare). |
I tried to verify the fix. I built I also ran hundreds of cycles with the fix. And tested without enabling the address sanitizer. No failures were encountered. The changes should fix this one, but I just wasn't able to verify it. |
yeah, did the same. i wrote a small shell script that launched mpv, quit it after a 1-2 seconds and let it run for nearly 2 hours. i didn't encounter a crash once sadly. |
this fixes a crash on quit, when a CATransaction from a system owned thread/event is happening at the same time. locking the context synchronises these access and prevents the race condition. the draw operation induced by any display call from the CAOpenGLLayer doesn't need that lock, since the display function already does lock that current context. Fixes mpv-player#11681
Important Information
Provide following Information:
Reproduction steps
First time incident. Happened when quitting movie (or possibly fast-forwarding towards end).
Since my code is patched with #11627, I can't see how it can be the same memory overwrite.
Also thread 1 contains references to
MRPlayer
etc. that I don't find in any other issues. (Or are threads other than the offendeng one arbitrary and irrelevant?)Expected behavior
No exception
Actual behavior
Log file
crash.log.gz
Sample files
N/A
The text was updated successfully, but these errors were encountered: