Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-hosted agent on M2 hangs when building MAUI on NET6 / NET7 / NET8 #19547

Closed
svaldetero opened this issue Nov 28, 2023 · 15 comments
Closed
Labels
bug If an issue is a bug or a pull request a bug fix
Milestone

Comments

@svaldetero
Copy link

Steps to Reproduce

  1. Create an agent on M2 ARM 64 agent (3.230.0)
  2. Build a .NET6/.NET7/.NET8 iOS project
  3. Notice the build hangs and never finishes

Expected Behavior

Build should finish

Actual Behavior

Build hangs with no output until it times out

Environment

Name Value
clang /usr/bin/clang
curl /usr/bin/curl
docker /usr/local/bin/docker
dotnet /usr/local/share/dotnet/dotnet
git /usr/bin/git
HOME /Users/envocadmin
InteractiveSession False
java /usr/bin/java
JAVA_HOME $(/usr/libexec/java_home)
JAVA_HOME_11_X64 /Library/Java/JavaVirtualMachines/microsoft-11.jdk/Contents/Home
JDK /usr/bin/javac
LANG en_US.UTF-8
LOGNAME envocadmin
make /usr/bin/make
MSBuild /Library/Frameworks/Mono.framework/Versions/Current/Commands/msbuild
PATH /usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/Users/envocadmin/.dotnet/tools:/Library/Apple/usr/bin:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/homebrew/bin:/opt/homebrew/bin
PWD /Users/Shared/Agents/Agent2
python3 /usr/bin/python3
rake /usr/bin/rake
ruby /usr/bin/ruby
SDKMANAGER $ANDROID_HOME/cmdline-tools/7.0/bin/sdkmanager
sh /bin/sh
SHELL /bin/zsh
SSH_AUTH_SOCK /private/tmp/com.apple.launchd.fZyAVo93Zs/Listeners
TMPDIR /var/folders/xk/7mssnqbd0v5f_z1fwblvmw1m0000gn/T/
USER envocadmin
VSTS_AGENT_SVC 1
Xamarin.Android /Library/Frameworks/Xamarin.Android.framework/Versions/Current/Version
Xamarin.iOS /Applications/Visual Studio.app/Contents/MacOS/vstool
xcode /Applications/Xcode.app/Contents/Developer
XPC_FLAGS 0x0
XPC_SERVICE_NAME 0
 

Let me know if I can email you diagnostic logs or the link to the build definition. I'd prefer not to post the whole logs here.

Versions

Agent version: 3.230.0
macOS: 13.6
.NET: 8.0.100
Microsoft.Maui.Sdk: 8.0.3
Microsoft.iOS.Sdk: 17.0.8478

Build Logs

Example Project (If Possible)

Previously following #17825 but can't update it since I'm not a collaborator. Also related microsoft/azure-pipelines-agent#4205
This has been an issue on NET 6 & NET 7. I just upgraded to NET 8 since it released and the above bug has been fixed, however I still have the issue.

Kicking off a dotnet publish for a NET 8 Maui ios app, hangs forever. Opening a terminal on the mac, pasting the exact line, and executing it in the agent's folder, runs fine.

Filed corresponding bug in Pipelines Agent: microsoft/azure-pipelines-agent#4546

@rolfbjarne I don't think your fix for limiting the parallelization of the CPU fixed the issue for me.

Debug output gave near identical output like this for the last 45+ minutes:

ADO Output

##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Agent running environment resource - Disk: available:627050.00MB out of 948584.00MB, Memory: used 27MB out of 65536MB, CPU: usage 0.02
##[debug]Starting diagnostic file upload.
##[debug]Setting up diagnostic log folders.
##[debug]Creating diagnostic log files folder.
##[debug]Creating diagnostic log environment file.

@rolfbjarne
Copy link
Member

It's a bummer the attempted fix didn't work... I'll try to figure out what's going on, but it may take a while.

Can you start by trying to run dotnet-stack as Filip suggests here: #17825 (comment)?

@rolfbjarne rolfbjarne added this to the Future milestone Nov 30, 2023
@rolfbjarne rolfbjarne added the bug If an issue is a bug or a pull request a bug fix label Nov 30, 2023
@svaldetero
Copy link
Author

@rolfbjarne I tried the command pgrep dotnet | xargs -L1 dotnet stack report -p after I suspect the build freezes. That command doesn't respond / return until the build times out and quits. When that happens, it prints out
[Error] System.IO.EndOfStreamException: Unable to read beyond the end of the stream
The command prints normally before or after I do that task in the build.

@rolfbjarne
Copy link
Member

rolfbjarne commented Dec 4, 2023

@rolfbjarne I tried the command pgrep dotnet | xargs -L1 dotnet stack report -p after I suspect the build freezes. That command doesn't respond / return until the build times out and quits. When that happens, it prints out [Error] System.IO.EndOfStreamException: Unable to read beyond the end of the stream The command prints normally before or after I do that task in the build.

Can you try using lldb instead then:

# print out what each process does
ps ux -p $(pgrep dotnet)

# print a backtrace for each process
pgrep dotnet | xargs -L1 lldb -o 'bt all' --batch -p

@svaldetero
Copy link
Author

When I run those commands on the build agent, it kills the build. Here's the outputs:

envocadmin@HM-MACSTUDIO ~ % ps ux -p $(pgrep dotnet)                            
USER         PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
envocadmin  3595   0.0  0.8 38722816 546536   ??  S     9:47AM   0:24.05 dotnet publish Mobile/src/Maui/EnvocTemplate.AppDevTemplate.Maui.csproj -f net8.0-ios --self-contained -r ios-arm64 -c Release -p:BuildIpa=True -p:ApplicationId=com.envoc
envocadmin@HM-MACSTUDIO ~ % pgrep dotnet | xargs -L1 lldb -o 'bt all' --batch -p
(lldb) process attach --pid 3595
error: attach failed: tried to attach to process already being debugged
Azure Devops Output EnvocTemplate.AppDevTemplate.Maui -> /Users/Shared/Agents/Agent2/_work/10/s/Mobile/src/Maui/bin/Release/net8.0-ios/ios-arm64/EnvocTemplate.AppDevTemplate.Maui.dll /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/bitcode_strip Optimizing assemblies for size may change the behavior of the app. Be sure to test after publishing. See: https://aka.ms/dotnet-illink Optimizing assemblies for size. This process might take a while. /Users/envocadmin/.nuget/packages/xamarin.firebase.ios.core/8.10.0.3/lib/net6.0-ios15.4/Firebase.Core.resources/GoogleUtilitiesComponents.xcframework/ios-arm64/GoogleUtilitiesComponents.framework/GoogleUtilitiesComponents : warning MT7091: The framework /Users/envocadmin/.nuget/packages/xamarin.firebase.ios.core/8.10.0.3/lib/net6.0-ios15.4/Firebase.Core.resources/GoogleUtilitiesComponents.xcframework/ios-arm64/GoogleUtilitiesComponents.framework is a framework of static libraries, and will not be copied to the app. [/Users/Shared/Agents/Agent2/_work/10/s/Mobile/src/Maui/EnvocTemplate.AppDevTemplate.Maui.csproj::TargetFramework=net8.0-ios] /Users/envocadmin/Library/Caches/XamarinBuildDownload/FAnlytcs-8.9.1/FirebaseAnalytics-8.9.1/Frameworks/FirebaseAnalytics.xcframework/ios-arm64_armv7/FirebaseAnalytics.framework/FirebaseAnalytics : warning MT7091: The framework /Users/envocadmin/Library/Caches/XamarinBuildDownload/FAnlytcs-8.9.1/FirebaseAnalytics-8.9.1/Frameworks/FirebaseAnalytics.xcframework/ios-arm64_armv7/FirebaseAnalytics.framework is a framework of static libraries, and will not be copied to the app. [/Users/Shared/Agents/Agent2/_work/10/s/Mobile/src/Maui/EnvocTemplate.AppDevTemplate.Maui.csproj::TargetFramework=net8.0-ios] /Users/envocadmin/Library/Caches/XamarinBuildDownload/GAppM-8.9.1/GoogleAppMeasurement-8.9.1/Frameworks/GoogleAppMeasurement.xcframework/ios-arm64_armv7/GoogleAppMeasurement.framework/GoogleAppMeasurement : warning MT7091: The framework /Users/envocadmin/Library/Caches/XamarinBuildDownload/GAppM-8.9.1/GoogleAppMeasurement-8.9.1/Frameworks/GoogleAppMeasurement.xcframework/ios-arm64_armv7/GoogleAppMeasurement.framework is a framework of static libraries, and will not be copied to the app. [/Users/Shared/Agents/Agent2/_work/10/s/Mobile/src/Maui/EnvocTemplate.AppDevTemplate.Maui.csproj::TargetFramework=net8.0-ios] assertion failed [_is_handling_requests && _is_suspended]: thread is unexpectedly in the runtime. arm_pc=0x7ff7ffc7d4fc arm_lr=0x7ff7ffc8f1c4 saved_lr=0x7ff89effac2c (ThreadContextRegisterState.cpp:390 guest_gpr_state_from_host_state) /Users/Shared/Agents/Agent2/_work/_temp/a3cc7759-f218-468e-ba53-7185874aeaeb.sh: line 1: 3595 Killed: 9 dotnet publish 'Mobile/src/Maui/EnvocTemplate.AppDevTemplate.Maui.csproj' -f net8.0-ios --self-contained -r ios-arm64 -c Release -p:BuildIpa=True -p:ApplicationId=com.envoctemplate.appdevtemplate.qa /p:ApplicationVersion=61713 /p:ApplicationDisplayVersion=0.1.3 /p:ApplicationTitle='QA' ##[error]Bash exited with code '137'. Finishing: Publish iOS

@rolfbjarne
Copy link
Member

tried to attach to process already being debugged would probably also explain why dotnet stack wasn't able to get a stack report either.

Next try, once the hang happens again:

  • Open Activity Monitor, double-click the stuck dotnet process, and click "Sample" in the lower left corner, then attach the resulting report here.

@akoeplinger @filipnavara any other ideas how to diagnose this?

@svaldetero
Copy link
Author

Here's the file
Sample of dotnet.txt

@rolfbjarne
Copy link
Member

rolfbjarne commented Dec 6, 2023

Ok, that shows numerous interesting stack traces, it seems the runtime has deadlocked, there are several mutexes that seem stuck:

  1. Creating new threads:
    +                       2204 ThreadNative_Start  (in libcoreclr.dylib) + 42  [0x1092ceb6a]
    +                         2204 ThreadNative::Start(Thread*, int, int, char16_t const*)  (in libcoreclr.dylib) + 208  [0x1092ced30]
    +                           2204 Thread::CreateNewThread(unsigned long, unsigned int (*)(void*), void*, char16_t const*)  (in libcoreclr.dylib) + 28  [0x10928657c]
    +                             2204 Thread::CreateNewOSThread(unsigned long, unsigned int (*)(void*), void*)  (in libcoreclr.dylib) + 140  [0x10928663c]
    +                               2204 PAL_CreateThread64  (in libcoreclr.dylib) + 105  [0x109197d29]
    +                                 2204 CorUnix::InternalCreateThread(CorUnix::CPalThread*, _SECURITY_ATTRIBUTES*, unsigned int, unsigned int (*)(void*), void*, unsigned int, CorUnix::PalThreadType, unsigned long*, void**)  (in libcoreclr.dylib) + 509  [0x109197bed]
    +                                   2204 CorUnix::InternalEnterCriticalSection(CorUnix::CPalThread*, _CRITICAL_SECTION*)  (in libcoreclr.dylib) + 349  [0x10918686d]
    +                                     2204 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1242  [0x7ff80ef67758]
    +                                       2204 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7ff80ef2b0de]
    +                                         2204 ???  (in <unknown binary>)  [0x7ff89eb7a9a8]
  1. Memory allocation:
    +                           2204 operator new(unsigned long)  (in libcoreclr.dylib) + 9  [0x1094614a9]
    +                             2204 szone_malloc_should_clear  (in libsystem_malloc.dylib) + 109  [0x7ff80ed98035]
    +                               2204 small_malloc_should_clear  (in libsystem_malloc.dylib) + 147  [0x7ff80ed981bd]
    +                                 2204 _os_unfair_lock_lock_slow  (in libsystem_platform.dylib) + 162  [0x7ff80ef92b07]
    +                                   2204 __ulock_wait  (in libsystem_kernel.dylib) + 10  [0x7ff80ef29d1e]
    +                                     2204 ???  (in <unknown binary>)  [0x7ff89eb7a9a8]
  1. Creating new processes:
    +                       2204 SystemNative_ForkAndExecProcess  (in libSystem.Native.dylib) + 174  [0x108f858de]
    +                         2204 _pthread_mutex_firstfit_lock_slow  (in libsystem_pthread.dylib) + 214  [0x7ff80ef62921]
    +                           2204 _pthread_mutex_firstfit_lock_wait  (in libsystem_pthread.dylib) + 76  [0x7ff80ef64b14]
    +                             2204 __psynch_mutexwait  (in libsystem_kernel.dylib) + 10  [0x7ff80ef2a872]

although the latter might be due to fork not returning in a timely manner?

    +                       2204 SystemNative_ForkAndExecProcess  (in libSystem.Native.dylib) + 705  [0x108f85af1]
    +                         2204 __fork  (in libsystem_kernel.dylib) + 11  [0x7ff80ef2e5cf]
    +                           2204 ???  (in <unknown binary>)  [0x7ff89eb7a01c]
    +                             2204 ???  (in runtime)  load address 0x7ff7ffec5000 + 0x145d8  [0x7ff7ffed95d8]
    +                               2204 ???  (in runtime)  load address 0x7ff7ffec5000 + 0x13fe4  [0x7ff7ffed8fe4]
    +                                 2204 ???  (in runtime)  load address 0x7ff7ffec5000 + 0x44fc  [0x7ff7ffec94fc]

Smaller sample text file with redundant frames removed:
Sample.of.dotnet-simplified.txt

@rolfbjarne
Copy link
Member

On a different note, I noticed it's the x64 version of dotnet that's being used:

Code Type: X86-64 (translated)

I'm not sure if it's possible to change this in your CI to use the arm64 version instead to see if that fixes the issue?

@svaldetero
Copy link
Author

svaldetero commented Dec 6, 2023

Here is my dotnet info on the machine:

envocadmin@HM-MACSTUDIO ~ % dotnet --info
.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  13.6
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /usr/local/share/dotnet/sdk/8.0.100/

.NET workloads installed:
 Workload version: 8.0.100-manifests.71b9f198
 [ios]
   Installation Source: SDK 8.0.100
   Manifest Version:    17.0.8478/8.0.100
   Manifest Path:       /usr/local/share/dotnet/sdk-manifests/8.0.100/microsoft.net.sdk.ios/17.0.8478/WorkloadManifest.json
   Install Type:        FileBased

 [maui]
   Installation Source: SDK 8.0.100
   Manifest Version:    8.0.3/8.0.100
   Manifest Path:       /usr/local/share/dotnet/sdk-manifests/8.0.100/microsoft.net.sdk.maui/8.0.3/WorkloadManifest.json
   Install Type:        FileBased

 [android]
   Installation Source: SDK 8.0.100
   Manifest Version:    34.0.43/8.0.100
   Manifest Path:       /usr/local/share/dotnet/sdk-manifests/8.0.100/microsoft.net.sdk.android/34.0.43/WorkloadManifest.json
   Install Type:        FileBased


Host:
  Version:      8.0.0
  Architecture: arm64
  Commit:       5535e31a71

.NET SDKs installed:
  6.0.410 [/usr/local/share/dotnet/sdk]
  6.0.417 [/usr/local/share/dotnet/sdk]
  7.0.304 [/usr/local/share/dotnet/sdk]
  7.0.311 [/usr/local/share/dotnet/sdk]
  8.0.100 [/usr/local/share/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 6.0.18 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.7 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 6.0.18 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.7 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

Is there any command that I can run to see which architecture is installed?

Part of the tasks in the pipeline are to install dotnet and maui workloads. Here's the task:

- task: UseDotNet@2
  displayName: 'Setup .NET' 
  inputs:
    packageType: 'sdk'
    version: '8.0.x'
    includePreviewVersions: true
    installationPath: $(Agent.ToolsDirectory)/dotnet

From what I can see, the Agent has folders _work/_tool/dotnet/sdk/8.0.100. I assume it is using that sdk and not the one globally installed on the machine? How can I force it to install the arm64 version?

@svaldetero
Copy link
Author

Well it looks like this is an existing issue: microsoft/azure-pipelines-tasks#19174.

If this really is the cause, it would explain why doing a build locally on the machine works but not in the agent.

@svaldetero
Copy link
Author

svaldetero commented Dec 6, 2023

Running these commands in the pipeline, returned the following results:

    uname -> Darwin
    uname -s -> Darwin
    uname -m -> x86_64
    uname -p -> i386
    arch -> i386
    sysctl -n machdep.cpu.brand_string -> Apple M2 Max

@rolfbjarne
Copy link
Member

From what I can see, the Agent has folders _work/_tool/dotnet/sdk/8.0.100. I assume it is using that sdk and not the one globally installed on the machine? How can I force it to install the arm64 version?

Maybe it's possible to force it to use the system .NET by setting the PATH environment variable before running dotnet?

PATH=/usr/local/share/dotnet:$PATH

@svaldetero
Copy link
Author

I ran multiple runs alternating commenting out the install dotnet task, and the ones that were commented out all ran successfully. I assume if you don't specify a version it uses what's installed on the machine (which I manually installed as arm64).

It looks like my workaround is to not install dotnet on self-hosted builds. I'm not sure if fixing running the wrong architecture of dotnet on a M2 causes the ios build to freeze is a priority or not.

Thank you for your help in figuring out what the issue is. I'll let you decide if you want to close this or not.

@rolfbjarne
Copy link
Member

The deadlock in dotnet could be a bug in dotnet (or maybe even Rosetta), so I'm moving this to dotnet/runtime.

@rolfbjarne
Copy link
Member

This issue was moved to dotnet/runtime#95704

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug If an issue is a bug or a pull request a bug fix
Projects
None yet
Development

No branches or pull requests

2 participants