-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault linux since net 6 upgrade #69323
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsDescriptionStarted getting a segmentation fault in linux after upgrading to net 6 (& PGO enabled too). It's not consistent though, happening 1/7. Seen it happen twice already. Call stack
Reproduction StepsSorry it's a complex project in a private repository. I can say there is a lot of action going on, high CPU/ram usage, file read/write. Expected behaviorNo segmentation fault Actual behaviorSegmentation fault Regression?Net 5 worked Known WorkaroundsNo response Configuration
Other informationNo response
|
FYI @BruceForstall Faulting code is runtime/src/coreclr/jit/optimizer.cpp Lines 1928 to 1933 in ef4ed6d
Presumably @Martin-Molinero you may be able to work around this by annotating the method that is being jitted here with either If you can (privately) share a core dump we can try and track down how we're hitting this issue. Also if you're not sure which method was being jitted we can probably figure this out from the core dump too. |
If you are able to debug under lldb, and can use SOS (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/debug-linux-dumps, https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-sos), and can find a MethodDesc pointer on the stack, using the There are likely some very complex or unusual loop structures. Perhaps if you can't share the code or a core dump, you could try to extract a sample of the same function that still exhibits the problem. |
Yes, but even after installing SOS the commands are not available at lldb as I understand they should be? I've shared along the core dump with @AndyAyersMS |
Thanks, I'll take a look. |
Based on the data you sent me offline, it seems like the JIT is compiling A workaround for now is to stop setting I can reproduce what look like similar issues in the .NET 6.0 by feeding the JIT randomized profile data, but I have yet to confirm if this is indeed the same problem or a related one. |
I think I understand roughly what happens. We have a flow graph with more than one loop. The first loop we find has some non-loop code in its extent, including two different stretches of code belonging to a second loop (because PGO has moved blocks). We manage to move the first extent with The fix for .NET 6 is likely is to just avoid the AV by giving up on recognizing the second loop. We hopefully can find a more robust fix in .NET 7 that avoids this sort of scrambled order. |
…t chain In dotnet#69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.
…t chain (#69503) In #69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.
…t chain In #69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.
This is fixed in main / 7.0, will close this once we've serviced it in 6.x |
…t chain (#69525) In #69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk. Co-authored-by: Andy Ayers <[email protected]>
@Martin-Molinero 6.0.7 is now out and hopefully fixes this and other issues you ran into with PGO: https://devblogs.microsoft.com/dotnet/july-2022-updates/ Let me know you get a chance to try it out. |
I'm going to close this, feel free to re-open if needed if you get around to trying 6.0.7. |
Description
Started getting a segmentation fault in linux after upgrading to net 6 (& PGO enabled too). It's not consistent though, happening 1/7. Seen it happen twice already.
Call stack
bt_all.txt
Reproduction Steps
Sorry it's a complex project in a private repository. I can say there is a lot of action going on, high CPU/ram usage, file read/write.
Expected behavior
No segmentation fault
Actual behavior
Segmentation fault
Regression?
Net 5 worked
Known Workarounds
No response
Configuration
Other information
The text was updated successfully, but these errors were encountered: