-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement thread suspension on Linux and Mac platform #3947
Comments
@sergiy-k There is already code for suspending and resuming Mac threads in the pal, see: CThreadSuspensionInfo::THREADHandleSuspendNative(CPalThread *pthrTarget) Is there some reason this isn't sufficient? |
@kangaroo You are right – the code is there. But, honestly, I don’t know if it still works because this code has some age (I’m sure you have noticed it yourself that there is a lot stuff in PAL that needs to cleaned up) so I want to test it first. Luckily I got a Mac machine yesterday that I could you use for testing. :) That’s why I opened this work item. Besides, there are a few PAL tests for thread suspension that fail even on Linux today. |
The existing PAL thread suspension code never worked reliably. It was not used for Silverlight - Silverlight used slower thread suspension technique via explicit GC probes. This issue is about designing and implementing a scheme that works reliably on Linux and Mac - without the overhead of explicit GC probes. The current PAL code is a pretty complex emulation of SuspendThread/ResumeThread/GetThreadContext/SetThreadContext Win32 APIs. src\vm\threaduspend.cpp wraps these APIs with another complex layer to actually perform the managed thread suspension. One potential avenue to investigate is to design a custom PAL APIs to eliminate some of this complexity - that just does what src\vm\threaduspend.cpp needs instead of general purpose emulation of Win32. |
@jkotas I'd be curious what unreliability you guys had for the MAC PAL. I've done an (albeit quick) once over of the code, and it looks functionally correct for a non-cooperative suspend/resume. |
The current implementation of PAL thread suspension requires a lot of code to be aware of it. It is impossible to ensure and review that all of that code is right. IIRC, the thread suspension was a stress bug factory in Silverlight days and that is why we switched to explicit GC probes as backstop. I believe that the right design of the thread suspension has to have minimal dependencies, e.g. any bugs in thread suspension have to be in the thread suspension code itself - that should be relatively small and easy to review. The design cannot not depend on code all over the place to cooperate for correctness. Look for comments like: #if !DEADLOCK_WHEN_THREAD_IS_SUSPENDED_WHILE_BLOCKED_ON_MUTEX Wrapper for mkstemp. mkstemp may invoke malloc, in which case a thread // Prime the location of CoreCLR so that we don't deadlock under |
@jkotas That presents an interesting design challenge on mac and linux. Any thread suspended in user land could be holding any lock, so I see a few obvious options: 1> Make all code that relies on suspending someone else lock and allocator free. Basically treat it as if it were operating in a signal handler. I think 2 is much trickier, since we could be talking about the malloc lock, printf lock, etc. Thoughts? |
Agree. 1 is the right approach. |
Agree. As a matter of fact, some of the runtime is already trying to that. For example, in threadsuspend.cpp in VM: // We can not allocate memory after we suspend a thread. In addition, existing PAL code tries to detect and mitigate this problem by using EnterUnsafeRegion/LeaveUnsafeRegion, which I want to get rid of as well. My plan is to first ensure that the existing code for suspending/resuming threads and getting/setting thread context works (on both Linux and Mac) and fix currently failing tests or at least understand why they are failing (this was the real intention of this issue). After that I will look at how we can make thread suspension safe and open a separate issue for that. |
I have been thinking about what we could do to guarantee that we don’t end up in a deadlock situation when using the Suspend/ResumeThread APIs. It seems that the best way to guarantee that this will not happen is to get rid of these dangerous APIs all together. If we can do this then, in addition to making the runtime code safer, it will also allow us to remove A LOT of code and complexity from PAL. As far as I remember, the main usage of the Suspend/Resume thread APIs in SuspendRuntime is to hijack threads and to drive the threads to a safe point. As @jkotas pointed out, this functionality is currently disabled on non-Windows platforms (by defining DISABLE_THREADSUSPEND) and it relies on the GCPOLL mechanism instead. Even though GCPOLL works OK in many cases, it cannot deal with the code that has tight long running (or never ending) loops that do not have GC synchronization points. So we do want to enable the hijacking functionality on non-Windows platforms too and I believe we can do it without the Suspend/ResumeThread APIs. Basically, the idea is that instead of suspending a thread and then doing a bunch of work on the suspended context in the SuspendRuntime function, we can tell a thread to do hijacking its own. For example, on Linux SuspendRuntime can send a signal to a thread. When the thread receive this signal, it inspects and modifies its own context and reports the status back to SuspendRuntime/GC once it’s done. Thoughts? Your feedback is more than welcomed. :) |
Just wanted to confirm my understanding. The broad proposed approach is that threads could be suspended/hijacked holding any lock, but only during the period of time where the runtime has not yet finished suspending as a whole. At the point that the GC code would run (or the debugger helper), we would have resumed those threads and blocked them somewhere more controlled to ensure they are not holding various low-level locks that might be necessary for OS and C++ runtime APIs to run normally. I'm not a suspension expert but this seems like a fine direction to me. I think one of the detail areas should be debugger interaction given the various synchronous and asynchronous CONTEXT modification the debugger can do. I'm happy to take a closer look with you at some point. |
I've done some investigation and prototyping for this work on Linux, and here is what I have in mind so far (at a high level): In SuspendRuntime, we raise a signal to each thread that needs to leave cooperative mode in order for GC to proceed. In the handler for that signal, we will capture the context and call a function that will determine the type of code that was interrupted and react accordingly.
Thoughts? I think this plan should be pretty sound but it could change if there are any unforeseen issues discovered. |
You should carefully review all code executed in the signal handler to make sure it only calls methods that are signal handler safe. |
@jkotas Agreed. This will be reviewed and tested thoroughly before moving forward. |
I believe it is actually ok to call even non-safe functions provided we call them only when we know we have interrupted the code in managed code. In such case, it is ok to call any platform functions, since we are sure that we are not in the middle of another platform function that could collide with the one we are executing (taking the same lock, using the same I/O buffer, ...) |
Agree. Only the code that determines whether we are in managed code needs to be scrutinized. |
dotnet/coreclr#1371 addresses this issue for Linux. |
I have assigned this issue to myself, since I am starting to work on the thread suspension for OSX.
The interesting point is why we redirect the thread when we are at an interruptible point instead of just keeping the thread suspended. The answer is in one of the comments in the threadsuspend.cpp:
Now the good thing about OSX thread_suspend is that the context of the suspended thread is accurate and so we don't need the redirection. And not needing the redirection means that we don't need to modify the context of the suspended thread. All we need to do is to suspend and possibly modify the return address on stack to hijack. |
@kangaroo Could you please take a look at the OSX plan above and let me know if you have any concerns? |
@janvorli In general it sounds reasonable to me. One question. What makes a point interruptible or not in regards to managed code? Is it possible for a non-interruptible point to be in a tight loop, or waiting on a conditional which would cause the return to not execute? |
@kangaroo, here is the description: |
Thanks @janvorli -- it sounds like from the explanation and my looking at the code, the concern case I raised isn't possible. |
@kangaroo I have implemented the runtime suspension the planned way and it works. But I have realized that using the way we use on Linux (injecting activation) on OSX would enable us to get rid of the safe / unsafe regions tracking in PAL. That would be a huge benefit. So I am changing the plan and I am going to implement the PAL_InjectActivation for OSX instead of mimicking the windows way, |
@janvorli SGTM |
OSX support implemented by dotnet/coreclr#1610 |
CoreCLR needs to be able to suspend threads in a managed application in order to run GC. Currently CoreCLR PAL implements threads suspension using signals on Linux. We need to enable and test implementation of thread suspension for Mac. In addition, we need to implement functionality for getting and setting context of a given thread in the current process.
The text was updated successfully, but these errors were encountered: