Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreal Crash Handler Investigation #674

Open
PlasmaDev5 opened this issue Nov 1, 2024 · 3 comments
Open

Unreal Crash Handler Investigation #674

PlasmaDev5 opened this issue Nov 1, 2024 · 3 comments
Assignees
Labels
Documentation Improvements or additions to documentation

Comments

@PlasmaDev5
Copy link
Collaborator

Overview

Out of the box unreal provides its own crash reporting tools to handle incidents on supported platforms. After initial investigation its been deemed work a deeper investigation and experimental implementation.

Locations of Interest

Unreal has aspects of the crash reporter distributed across multiple areas of the engine but there are a few key locations of note.

Crash Report Client :https://github.com/EpicGames/UnrealEngine/tree/release/Engine/Source/Programs/CrashReportClient

Root for core level abstractions.(This includes the stackwalker, crash context and more) :https://github.com/EpicGames/UnrealEngine/tree/release/Engine/Source/Runtime/Core/Public/GenericPlatform

Crash Report Core:https://github.com/EpicGames/UnrealEngine/tree/release/Engine/Source/Runtime/CrashReportCore

Gpu Crash:
UnrealEngine/Engine/Source/Runtime/RenderCore/Private/GPUDebugCrashUtils.cpp at release · EpicGames/UnrealEngine
UnrealEngine/Engine/Source/Runtime/RenderCore/Private/DumpGPU.cpp at release · EpicGames/UnrealEngine

Information Tracked

Variable Comment
CrashGUID An unique report name that this crash belongs to. Folder name
GameName The name of the game that crashed. (AppID)
ExecutableName The name of the exe that crashed. (AppID)
EngineMode The mode the game was in e.g. editor.
DeploymentName Deployment (also known as "EpicApp"), e.g. DevPlaytest, PublicTest, Live
EngineModeEx EngineModeEx e.g. Unset, Dirty, Vanilla
PlatformFullName The platform that crashed e.g. Win64.
EngineVersion Encoded engine version. (AppVersion)
E.g. 4.3.0.0-2215663+UE-Releases+4.3
BuildVersion-BuiltFromCL-BranchName
CommandLine The command line of the application that crashed.
BaseDir The base directory where the app was running.
AppDefaultLocale The language ID the application that crashed.
UserName The name of the user that caused this crash.
LoginId The unique ID used to identify the machine the crash occurred on.
EpicAccountId The Epic account ID for the user who last used the Launcher.
GameSessionID The last game session id set by the application. Application specific meaning. Some might not set this.
PCallStackHash A hash representing a unique id for a portable callstack. These will be specific to the CL version of the application
CrashSignal The signal that was raised to enter the crash handler
NumMinidumpFramesToIgnore Specifies the number of stack frames in the callstack to ignore when symbolicating from a minidump.
CallStack An array of FStrings representing the callstack of the crash.
SourceContext An array of FStrings showing the source code around the crash.
Modules An array of module's name used by the game that crashed.
UserDescription An array of FStrings representing the user description of the crash.
UserActivityHint An FString representing the user activity, if known, when the error occurred.
ErrorMessage The error message, can be assertion message, ensure message or message from the fatal error.
FullCrashDumpLocation Location of full crash dump. Displayed in the crash report frontend.
TimeOfCrash The UTC time the crash occurred.
bAllowToBeContacted Whether the user allowed us to be contacted.
f true the following properties are retrieved from the system: UserName (for non-launcher build) and EpicAccountID.
CrashReporterMessage Rich text string (should be localized by the crashing application) that will be displayed in the main CRC dialog
Can be empty and the CRC's default text will be shown.
PlatformCallbackResult Platform-specific UE Core value (integer).
CrashReportClientVersion CRC sets this to the current version of the software.
bHasMiniDumpFile Whether this crash has a minidump file.
bHasLogFile Whether this crash has a log file.
bHasPrimaryData Whether this crash contains primary usable data.
RestartCommandLine Copy of CommandLine that isn't anonymized so it can be used to restart the process
bIsEnsure Whether the report comes from a non-fatal event such as an ensure
CrashType The type of crash being reported, e.g. Assert, Ensure, Hang
CPUBrand The cpu brand of the device, e.g. Intel, iPhone6, etc.
Threads Thread contexts, XML elements containing info specific to an active thread, e.g. callstacks
PlatformPropertiesExtras Optional additional data for platform properties
bIsOOM Whether it was an OOM (Out Of Memory) or not
bLowMemoryWarning Whether we got a low memory warning or not
bInBackground Whether we were in the background when the crash happened
bIsRequestingExit Whether we crashed during shutdown

Implementation Challenges

After reaching out to get more information on how native handles backends it came to my attention that we will run into issues supporting this crash handler. This is because the selected backend must exist at compile time with the current setup. This wouldn't be possible for unreal with the current setup as you would have to make unreal a 3rd party to native.

Reasons to Support Unreal Crash

There are a handful of reasons we may want to move forward with the Unreal Crash Handler. The biggest reason is we gain access to GPU related crash data such as Hangs and Device loss errors. In game development this is a very important aspect of error coverage. This is something that was recently request GPU crashes not captured by Sentry plugin · Issue #673
In terms of the wider dump information it doesn't look like there is to much information here that we don't already track but there is the potential for more detailed engine information that we do not normally tack such as the execution command line and the crash report message from the engine potentially pointing to the problem.
With Unreal's crash handler I believe it supports all target platforms at least in some form (currently not able to validate closed platforms). This could potentially expedite any porting projects and ensuring compatibility with all engine exports.
Additionally we also gain much better legacy support for the engine as the crash handler very rarely receives API breaking changes. This would mean that we can support much older engine versions with at worse just some minor version defines.
Finally this would solve ongoing issues with the current crash handler has been known to break compatibility via its updates. Currently this has required significant resources to fix these recurring update issues that could better be applied elsewhere.

What do we lose

One thing to note is we would likely have to do everything in-process this is a mix of a few limitations, first of all i believe we are limited on the marketplace side where we are not able to distribute executables programs. Additionally I believe that plugins are not capable of building executables therefore we would have to run a engine fork instead of a plugin similar to how NVidia operate there forks.
From what i can see the current Crashpad solution captures more raw system and crash information not directly related to Unreal itself. This level of information could potentially still be useful for developers.

Solutions

Option A

One potential option if we intend to move forward is open up native to allow for runtime backends. By doing this we would be able to add Unreal Crash Core as a backend into native with minimal changes to the current unreal plugin. I also believe this could be a benefit to native allowing other developers to inject there custom crash handlers as required.

Option B

Another potential solution is compiling a custom sentry native as part of the plugin, this version would have Unreal as a dependency and implement the crash handler in question. Currently this is my least desired approach as it would likely cause maintenance issues ensuring that this native version says in step with the main version. It would also introduce build complexity adding a custom 3rd party to compile as part of the build process.

Option 3

An option that was suggested whilst gathering info on native is potentially dropping native completely a re-implementing its behavior via unreal. This means we would re-implement much of the higher level API on sentry whilst using unreals packet and messaging system to send the data to Sentry. This provides a few gains on paper such as a simplified build system as we no longer need to pull native from CI but also comes with some tradeoffs like scale of work required to implement and maintain this new API.

@PlasmaDev5 PlasmaDev5 added the Documentation Improvements or additions to documentation label Nov 1, 2024
@PlasmaDev5 PlasmaDev5 self-assigned this Nov 1, 2024
@bruno-garcia
Copy link
Member

We need the list of info we get from crashpad, so we can understand what we're missing.

Notes on in-proc vs out-of-proc

Out-of-process and in-process crash handlers each have distinct advantages depending on the needs of the application and the level of resilience desired. Here’s a comparison of their main benefits:

Out-of-Process Crash Handler
Higher Resilience: An out-of-process crash handler runs in a separate process, so if the main application crashes or has severe memory corruption, the crash handler remains unaffected. This makes it highly reliable for capturing crash details, especially in cases of severe crashes (e.g., stack overflow or heap corruption).

Detailed Crash Reporting: Since it’s unaffected by the crashing process’s state, it can often gather more detailed diagnostics, including the memory dump, stack trace, and error codes, without being impacted by the corrupted state.

Isolation from Application Failures: Because it operates separately, an out-of-process handler isn’t at risk of crashing alongside the application. This isolation helps ensure a higher success rate in recording crash data, particularly in scenarios with critical memory issues or extensive runtime faults.

Lower Performance Overhead: Out-of-process handlers tend to have lower performance impact on the main application since they don’t continuously operate in the same memory space or on the same thread, which can help maintain performance under normal operations.

Security and Stability: Running out of process limits the scope of what can be accessed or affected by the crashing process. This design is more secure and less likely to interfere with the crash handler’s stability.

In-Process Crash Handler
Faster Crash Handling: Since the handler runs within the same process, it can access data directly without any inter-process communication. This can make crash reporting faster, with lower latency in collecting initial crash data, which is beneficial for real-time or latency-sensitive applications.

More Contextual Information: The in-process handler can access specific runtime details that may be lost in an out-of-process handler. For example, it can capture local variables and more immediate context about the function calls leading up to the crash.

Simpler Implementation: In-process crash handlers can be easier to implement because they don’t require inter-process communication or separate handling mechanisms. This can be beneficial for lightweight applications or scenarios where crash handling is not mission-critical.

Direct Memory Access: With direct access to the same memory space, in-process handlers can gather more granular details from the application state, especially if the crash isn’t catastrophic (e.g., less severe memory violations).

Choosing Between the Two
Out-of-process is generally preferred in cases where high resilience, robustness, and stability are key—such as in critical applications or those running on less predictable hardware.
In-process can be a better choice when you need lower latency, less complexity, or deeper insights into the immediate runtime state at the time of failure, and when minor crashes are more common than severe ones.
Many systems use a hybrid approach, employing in-process crash handlers to gather quick diagnostics and then offloading more comprehensive data collection to an out-of-process handler for additional stability and detail.

@bruno-garcia
Copy link
Member

In the list of files u linked, I didn't see what actually creates the dumps. They just look for the dumps in the filesystem. Which tells me the engine is creating the dumps (in-proc)/

For example, Mac and iOS use PLCrashReporter:

On Windows I see MiniDumpWriteDump, at least from Unsync (is that the engine crash handling mechanism)?

Are these the actual mechanisms used by the crash reporter?

After reaching out to get more information on how native handles backends it came to my attention that we will run into issues supporting this crash handler.

The biggest reason is we gain access to GPU related crash data such as Hangs and Device loss errors

What does this actually look like in code (the files reading this data and adding to the crash dump)?
With that information we can see how hard is it to add that to sentry-native directly.

for more detailed engine information that we do not normally tack such as the execution command line and the crash report message from the engine potentially pointing to the problem.

Shouldn't we be able to get this already from sentry-native and add as context?

By doing this we would be able to add Unreal Crash Core as a backend into native with minimal changes to the current unreal plugin.

Would we though? I'm not convinced yet because it's not clear to me what code we'd nee to use from the Unreal Engine and the effort it would be to take them. It sounds like we'd just be moving over to other standard crash reporting libraries and functions so not much gain there.

We can already call MiniDumpWriteDump from sentry-native on Windows if that's a better route than crashpad/breakpad, we don't need to look into Unreal's source code for that. We definitely don't gain with PLCrashReporter vs using our own sentry-cocoa SDK, for example. There's a LOT we gain from keeping that under the hood on mobile.

An option that was suggested whilst gathering info on native is potentially dropping native completely a re-implementing its behavior via unreal.

We're definitely not doing that.

@tustanivsky
Copy link
Collaborator

tustanivsky commented Nov 5, 2024

for more detailed engine information that we do not normally tack such as the execution command line and the crash report message from the engine potentially pointing to the problem.

We already have a simple integration with the UE Crash Handler that allows us to enrich the captured crashes with some of the abovementioned properties. Basically, we're setting the on_crash handler during the Native SDK initialization (works for crashpad backend only) and attempt to grab additional crash info which is then attached to the corresponding event.

One more place to look for ways to hook into Unreal's crash-handling flow is the FOutputDeviceError and the platform-specific implementations of its Serialize and HandleError methods. Our custom implementation of this class is used for intercepting asserts however I assume it may be useful for reporting errors as well. The idea is to try constructing an equivalent Sentry event manually with all the info Unreal managed to collect by the time HandleError override is called and send it via the Native SDK API (i.e. sentry_handle_exception or sentry_capture_event). In case that will work as expected using sentry-native without a backend may become another option to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Improvements or additions to documentation
Projects
Status: No status
Development

No branches or pull requests

3 participants