Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createdump fails on RHEL 8/arm64 with "stack smashing detected" #108023

Closed
omajid opened this issue Sep 19, 2024 · 12 comments
Closed

createdump fails on RHEL 8/arm64 with "stack smashing detected" #108023

omajid opened this issue Sep 19, 2024 · 12 comments
Assignees
Milestone

Comments

@omajid
Copy link
Member

omajid commented Sep 19, 2024

Description

I am trying to run createdump against an ASP.NET Core application running on RHEL 8 on arm64/aarch64 . This works flawlessly with .NET 8, but fails with .NET 9.

Reproduction Steps

$ ~/dotnet/dotnet new web
$ ~/dotnet/dotnet run
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://localhost:5082                                  
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.                          
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /home/omajid/hello

# in a separate terminal
$ ~/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump -f dump 33966
[createdump] Gathering state for process 33966 dotnet
[createdump] Writing minidump with heap to file dump
[createdump] Written 267321344 bytes (4079 pages) to core file
[createdump] Target process is alive
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

Expected behavior

I get a dump

Actual behavior

Fails

Regression?

This was working in .NET 8. It was broken on both .NET 9 Preview 7 and .NET 9 RC 1.

Known Workarounds

No response

Configuration

This is the .NET 9 RC 1 SDK published by Microsoft:

$ ~/dotnet/dotnet --info
.NET SDK:
 Version:           9.0.100-rc.1.24452.12
 Commit:            81a714c6d3
 Workload version:  9.0.100-manifests.a7bf2b8f
 MSBuild version:   17.12.0-preview-24422-09+d17ec720d

Runtime Environment:
 OS Name:     rhel
 OS Version:  8
 OS Platform: Linux
 RID:         linux-arm64
 Base Path:   /home/omajid/dotnet/sdk/9.0.100-rc.1.24452.12/

.NET workloads installed:
Configured to use loose manifests when installing new manifests.
There are no installed workloads to display.

Host:
  Version:      9.0.0-rc.1.24431.7
  Architecture: arm64
  Commit:       static

.NET SDKs installed:
  9.0.100-rc.1.24452.12 [/home/omajid/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-rc.1.24452.1 [/home/omajid/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-rc.1.24431.7 [/home/omajid/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

This also reproduces with a self-built .NET 9 using the VMR/source-build.

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

This only happens on arm64/aarch64. It doesn't happen on x64.

$ uname -m
aarch64

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 19, 2024
@omajid
Copy link
Member Author

omajid commented Sep 19, 2024

cc @tmds

@jkotas jkotas added area-Diagnostics-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@tommcdon tommcdon added this to the 10.0.0 milestone Sep 19, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Sep 19, 2024
@tmds
Copy link
Member

tmds commented Sep 19, 2024

RHEL 8 on arm64/aarch64

We've had some issues in the past due to the 64kB page size (like #91864), this may be another one of those.

@mikem8361
Copy link
Member

@omajid, is there any way you could catch the stack smashing under lldb? It is going to take me a while to put together an RHEL 8 arm64 device.

@omajid
Copy link
Member Author

omajid commented Sep 23, 2024

$ lldb /usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump
(lldb) target create "/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump"                                                          
Current executable set to '/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump' (aarch64).
(lldb) r -f dump 34680
Process 35063 launched: '/usr/lib64/dotnet/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/createdump' (aarch64)
[createdump] Gathering state for process 34680 dotnet
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
Process 35063 stopped and restarted: thread 1 received signal: SIGCHLD
[createdump] Writing minidump with heap to file dump
Process 35063 stopped
* thread #1, name = 'createdump', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x1000000000000)                                
    frame #0: 0x0000fffff7b22d40 libc.so.6`__GI___memset_generic + 256
libc.so.6`__GI___memset_generic:
->  0xfffff7b22d40 <+256>: dc     zva, x3
    0xfffff7b22d44 <+260>: add    x3, x3, #0x40
    0xfffff7b22d48 <+264>: subs   x2, x2, #0x40
    0xfffff7b22d4c <+268>: b.hi   0xfffff7b22d40            ; <+256>
(lldb) bt
* thread #1, name = 'createdump', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x1000000000000)                                
  * frame #0: 0x0000fffff7b22d40 libc.so.6`__GI___memset_generic + 256
    frame #1: 0x0000aaaaaaac5ad0 createdump`DumpWriter::WriteDiagInfo(unsigned long) [inlined] memset(__dest=0x0000ffffffffa298, __ch=0, __len=65496) at string_fortified.h:74:10 [opt]
    frame #2: 0x0000aaaaaaac5ac0 createdump`DumpWriter::WriteDiagInfo(this=0x0000ffffffffa280, size=<unavailable>) at dumpwriter.cpp:50:5 [opt]              
    frame #3: 0x0000aaaaaaabfbf4 createdump`DumpWriter::WriteDump(this=0x0000ffffffffa280) at dumpwriterelf.cpp:181:18 [opt]                                 
    frame #4: 0x0000aaaaaaabc5f0 createdump`CreateDump(options=0x0000ffffffffe328) at createdumpunix.cpp:89:25 [opt]
(lldb) frame select 2
frame #2: 0x0000aaaaaaac5ac0 createdump`DumpWriter::WriteDiagInfo(this=0x0000ffffffffa280, size=<unavailable>) at dumpwriter.cpp:50:5 [opt]
   47       }
   48       size_t alignment = size - sizeof(header);
   49       assert(alignment < sizeof(m_tempBuffer));
-> 50       memset(m_tempBuffer, 0, alignment);
   51       if (!WriteData(m_tempBuffer, alignment)) {
   52           return false;
   53       }
(lldb) p alignment
(size_t) 65496
(lldb) p size
error: Couldn't materialize: couldn't get the value of variable size: Could not evaluate DW_OP_entry_value.                                                  
error: errored out in DoExecute, couldn't PrepareToExecuteJITExpression
(lldb) p sizeof(header)
(unsigned long) 40
(lldb) p sizeof(m_tempBuffer)
(unsigned long) 16384

Looks like we are trying to write 65496 bytes to a location that can only hold 16384 bytes.

$ getconf PAGE_SIZE
65536
$ python3 -c 'print(65536 - 40)'
65496

Yeah, looks like a page size issue like @tmds mentioned above.

@omajid
Copy link
Member Author

omajid commented Sep 23, 2024

Digging a bit, it looks like @tmds 's changes at #91865 were reverted by #95433. So the original issue in #91864 has re-appeared.

@mikem8361
Copy link
Member

Thanks for figuring this out. Looks like I did the original fix trying to fix an assert on MacOS arm64. I'm looking into how to fix both.

mikem8361 added a commit to mikem8361/runtime that referenced this issue Sep 23, 2024
Change the special diag info block size back to SpecialDiagInfoSize so m_tempBuffer isn't
overwritten by 64K PAGE_SIZE and use the 4 parameter MemoryRegion constructor that doesn't
assert the address/size is on a PAGE_SIZE bountry.

The changes from PR dotnet#91865 were reverted by PR dotnet#95433.

Issue: dotnet#108023
mikem8361 added a commit to mikem8361/runtime that referenced this issue Sep 23, 2024
The changes from PR dotnet#91865 were reverted by PR dotnet#95433.

This change restores the fix from PR dotnet#91865 by changing the size back to SpecialDiagInfoSize
but uses the 4 parameter MemoryRegion constructor that doesn't assert the address/size is on
a PAGE_SIZE alignment (PR dotnet#95433).

Issue: dotnet#108023
mikem8361 added a commit that referenced this issue Sep 24, 2024
The changes from PR #91865 were reverted by PR #95433.

This change restores the fix from PR #91865 by changing the size back to SpecialDiagInfoSize
but uses the 4 parameter MemoryRegion constructor that doesn't assert the address/size is on
a PAGE_SIZE alignment (PR #95433).

Issue: #108023
github-actions bot pushed a commit that referenced this issue Sep 24, 2024
The changes from PR #91865 were reverted by PR #95433.

This change restores the fix from PR #91865 by changing the size back to SpecialDiagInfoSize
but uses the 4 parameter MemoryRegion constructor that doesn't assert the address/size is on
a PAGE_SIZE alignment (PR #95433).

Issue: #108023
@mikem8361
Copy link
Member

@omajid, is there anyway you could validate this fix (PR #108166) your issue?

@omajid
Copy link
Member Author

omajid commented Sep 24, 2024

Yes, I should be able to take a VMR checkout, apply this change and see if the resulting SDK has any issues or not. Looking at it now.

@tommcdon tommcdon modified the milestones: 10.0.0, 9.0.0 Sep 24, 2024
@tommcdon
Copy link
Member

Moving issue to 9.0.0 for backport

@omajid
Copy link
Member Author

omajid commented Sep 24, 2024

I can confirm this PR makes things work for me again:

$ uname -m
aarch64
$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
$ ~/dotnet-sdk/dotnet --info
.NET SDK:
 Version:           9.0.100-rc.2.24474.1
 Commit:            1f747cd885
 Workload version:  9.0.100-manifests.934ebbcd
 MSBuild version:   17.12.0-preview-24469-05+1f747cd88

Runtime Environment:
 OS Name:     rhel
 OS Version:  8
 OS Platform: Linux
 RID:         rhel.8.10-arm64
 Base Path:   /home/omajid/dotnet-sdk/sdk/9.0.100-rc.2.24474.1/

.NET workloads installed:
There are no installed workloads to display.
Configured to use loose manifests when installing new manifests.

Host:
  Version:      9.0.0-rtm.24473.2
  Architecture: arm64
  Commit:       static

.NET SDKs installed:
  9.0.100-rc.2.24474.1 [/home/omajid/dotnet-sdk/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-rtm.24473.16 [/home/omajid/dotnet-sdk/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-rtm.24473.2 [/home/omajid/dotnet-sdk/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download
$ ~/dotnet-sdk/shared/Microsoft.NETCore.App/9.0.0-rtm.24473.2/createdump 357990
[createdump] Gathering state for process 357990 dotnet
[createdump] Writing minidump with heap to file /tmp/coredump.357990
[createdump] Written 339873792 bytes (5186 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written in 360ms

I tried again without the fix in #108166 and it continues to crash, confirming that #108166 is the fix.

jeffschwMSFT added a commit that referenced this issue Sep 25, 2024
The changes from PR #91865 were reverted by PR #95433.

This change restores the fix from PR #91865 by changing the size back to SpecialDiagInfoSize
but uses the 4 parameter MemoryRegion constructor that doesn't assert the address/size is on
a PAGE_SIZE alignment (PR #95433).

Issue: #108023

Co-authored-by: Mike McLaughlin <[email protected]>
Co-authored-by: Jeff Schwartz <[email protected]>
@omajid
Copy link
Member Author

omajid commented Sep 26, 2024

Now that #108166 and #108208 have been merged, I am going to close this issue.

Thanks!

@omajid omajid closed this as completed Sep 26, 2024
sirntar pushed a commit to sirntar/runtime that referenced this issue Sep 30, 2024
The changes from PR dotnet#91865 were reverted by PR dotnet#95433.

This change restores the fix from PR dotnet#91865 by changing the size back to SpecialDiagInfoSize
but uses the 4 parameter MemoryRegion constructor that doesn't assert the address/size is on
a PAGE_SIZE alignment (PR dotnet#95433).

Issue: dotnet#108023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants