Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for ASP.NET ConfigBuilder issue #6147

Merged
merged 3 commits into from
Oct 31, 2024

Conversation

andrewlock
Copy link
Member

@andrewlock andrewlock commented Oct 11, 2024

Summary of changes

Adds a check that IisPreStartInit has completed before we run any automatic instrumentations

Reason for change

We recently had a case where a customer was using AzureKeyVaultConfigBuilder with ASP.NET's web.config to load configuration into AppSettings, which was causing the application to deadlock on startup.

After some investigation, we isolated the problem as happening specifically when there are 2 apps running inside an app pool:

  • App 1 starts
    • The config builder populates and loads from key vault
    • We initialize the tracer, and set up instrumentation
    • App1 works fine 👍
  • App 2 starts
    • The config builder tries to populate and load from key vault
      • This requires making HTTP requests
      • Due to the instrumentation run for app 1, we instrument the HTTP request
      • This invokes CallTargetInvoker which tries to initialize the tracer
      • Initializing the tracer requires reading AppSettings so that we can populate configuration
        • The config builder tries to populate and load from key vault
        • ♻️re-entry 💥

boom

So the key thing is that we need to make sure we don't run our automatic instrumentation until after the IIS pre-init stage is completely, to avoid re-entry and recursion during setup.

Implementation details

The implementation is leveraging work we did years ago to fix essentially the same problem: #1157. The problem back then was with Liblog and NLog, but we did all the work to inject flags for tracking when it is safe to make changes.

Given the native work already exists, we can piggy-pack on those hooks and make sure we don't run any automatic instrumentations unless the domain as finished pre-initialization.

This is still relatively fragile, as things like adding a static reference to IDatadogLogger would cause a static initialization to happen too early, and ultimately deadlock/crash the app, so added a bunch of comments to try to highlight the issue

Test coverage

This is kind of a pain to test because it requires a custom config builder, plus two applications running in an app pool. I tested manually by

  • Created a custom ConfigBuilder based on the AzureKeyVaultConfigBuilder implementation. The custom builder simply makes a generic HTTP request when a value is requested
  • Create a generic .NET 472 asp.net app
  • Add to the web.config to set up the builder (see below)
  • Publish the app
  • Create two asp.net apps in IIS, using the same app pool
  • Hit the first app - all good 👍
  • Hit the second app - 💥 The configBuilder 'CustomBuilder' failed while processing the configuration section 'appSettings'.: The ConfigurationBuilder 'CustomBuilder[Microsoft.Configuration.ConfigurationBuilders.CustomConfigBuilder]' has recursively re-entered processing of the 'appSettings' section (I based my implementation on v3 which specifically detects re-entry)
  • Made the fix, now it works 🎉

web.config for my dummy test:

<configuration>
  <configSections>
    <section name="configBuilders"
      type="System.Configuration.ConfigurationBuildersSection, 
      System.Configuration, Version=4.0.0.0, Culture=neutral, 
      PublicKeyToken=b03f5f7f11d50a3a"
      restartOnExternalChanges="false" requirePermission="false" />
  </configSections>

  <configBuilders>
    <builders>
      <add name="CustomBuilder" preloadSecretNames="false" Uri="https://some-value-prd.vault.azure.net/"
      type="Microsoft.Configuration.ConfigurationBuilders.CustomConfigBuilder, CustomConfigBuilder, Version=1.0.0.0, Culture=neutral" />
    </builders>
  </configBuilders>

  <appSettings configBuilders="CustomBuilder">
    <add key="DummyKey1" value="DummyKey1 value from web.config" />
    <add key="DummyKey2" value="DummyKey2 value from web.config" />
  </appSettings>

Complete test solution is here:
ConfigBuilderIssueRepro.zip

Other details

Fixes: https://datadoghq.atlassian.net/browse/APMS-13426

@andrewlock andrewlock added type:bug area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) labels Oct 11, 2024
@andrewlock andrewlock requested a review from a team as a code owner October 11, 2024 14:30
@datadog-ddstaging
Copy link

datadog-ddstaging bot commented Oct 11, 2024

Datadog Report

Branch report: andrew/fix-recurrsion-in-aspnet
Commit report: 8b4c8b6
Test service: dd-trace-dotnet

✅ 0 Failed, 363283 Passed, 2064 Skipped, 15h 8m 50.56s Total Time

@andrewlock
Copy link
Member Author

andrewlock commented Oct 11, 2024

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (71ms)  : 68, 73
     .   : milestone, 71,
    master - mean (70ms)  : 68, 72
     .   : milestone, 70,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (1,114ms)  : 1091, 1138
     .   : milestone, 1114,
    master - mean (1,111ms)  : 1094, 1128
     .   : milestone, 1111,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (110ms)  : 107, 113
     .   : milestone, 110,
    master - mean (109ms)  : 106, 112
     .   : milestone, 109,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (773ms)  : 761, 786
     .   : milestone, 773,
    master - mean (768ms)  : 751, 784
     .   : milestone, 768,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (93ms)  : 89, 97
     .   : milestone, 93,
    master - mean (93ms)  : 89, 96
     .   : milestone, 93,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (727ms)  : 708, 747
     .   : milestone, 727,
    master - mean (730ms)  : 711, 750
     .   : milestone, 730,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (190ms)  : 187, 192
     .   : milestone, 190,
    master - mean (191ms)  : 186, 195
     .   : milestone, 191,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (1,202ms)  : 1178, 1227
     .   : milestone, 1202,
    master - mean (1,204ms)  : 1176, 1232
     .   : milestone, 1204,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (276ms)  : 272, 280
     .   : milestone, 276,
    master - mean (275ms)  : 271, 280
     .   : milestone, 275,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (940ms)  : 923, 958
     .   : milestone, 940,
    master - mean (950ms)  : 935, 966
     .   : milestone, 950,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6147) - mean (264ms)  : 260, 268
     .   : milestone, 264,
    master - mean (265ms)  : 261, 268
     .   : milestone, 265,

    section CallTarget+Inlining+NGEN
    This PR (6147) - mean (922ms)  : 903, 940
     .   : milestone, 922,
    master - mean (928ms)  : 906, 950
     .   : milestone, 928,

Loading

@andrewlock
Copy link
Member Author

andrewlock commented Oct 11, 2024

Throughput/Crank Report ⚡

Throughput results for AspNetCoreSimpleController comparing the following branches/commits:

Cases where throughput results for the PR are worse than latest master (5% drop or greater), results are shown in red.

Note that these results are based on a single point-in-time result for each branch. For full results, see one of the many, many dashboards!

gantt
    title Throughput Linux x64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (6147) (11.119M)   : 0, 11119397
    master (11.119M)   : 0, 11118546
    benchmarks/2.9.0 (11.081M)   : 0, 11080577

    section Automatic
    This PR (6147) (7.426M)   : 0, 7425613
    master (7.321M)   : 0, 7321152
    benchmarks/2.9.0 (7.732M)   : 0, 7732233

    section Trace stats
    master (7.757M)   : 0, 7756918

    section Manual
    master (11.210M)   : 0, 11209846

    section Manual + Automatic
    This PR (6147) (6.821M)   : 0, 6821227
    master (6.740M)   : 0, 6740459

    section DD_TRACE_ENABLED=0
    master (10.109M)   : 0, 10109349

Loading
gantt
    title Throughput Linux arm64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (6147) (9.609M)   : 0, 9608625
    master (9.561M)   : 0, 9560673
    benchmarks/2.9.0 (9.798M)   : 0, 9798067

    section Automatic
    This PR (6147) (6.520M)   : 0, 6520075
    master (6.578M)   : 0, 6578194

    section Trace stats
    master (6.868M)   : 0, 6867545

    section Manual
    master (9.568M)   : 0, 9568124

    section Manual + Automatic
    This PR (6147) (6.032M)   : 0, 6031910
    master (6.203M)   : 0, 6203435

    section DD_TRACE_ENABLED=0
    master (8.858M)   : 0, 8857806

Loading
gantt
    title Throughput Windows x64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (6147) (10.313M)   : 0, 10313173
    master (9.889M)   : 0, 9889232
    benchmarks/2.9.0 (10.067M)   : 0, 10067315

    section Automatic
    This PR (6147) (6.686M)   : 0, 6685782
    master (6.396M)   : 0, 6396160
    benchmarks/2.9.0 (7.552M)   : 0, 7552193

    section Trace stats
    master (7.205M)   : 0, 7205159

    section Manual
    master (9.853M)   : 0, 9852595

    section Manual + Automatic
    This PR (6147) (6.263M)   : 0, 6262824
    master (5.939M)   : 0, 5939480

    section DD_TRACE_ENABLED=0
    master (9.222M)   : 0, 9221606

Loading

@andrewlock
Copy link
Member Author

andrewlock commented Oct 11, 2024

Benchmarks Report for tracer 🐌

Benchmarks for #6147 compared to master:

  • 1 benchmarks are faster, with geometric mean 1.148
  • 1 benchmarks have fewer allocations
  • 1 benchmarks have more allocations

The following thresholds were used for comparing the benchmark speeds:

  • Mann–Whitney U test with statistical test for significance of 5%
  • Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartStopWithChild net6.0 7.8μs 43.4ns 310ns 0.0152 0.00758 0 5.43 KB
master StartStopWithChild netcoreapp3.1 9.86μs 54.5ns 318ns 0.0242 0.00968 0 5.61 KB
master StartStopWithChild net472 16.1μs 59ns 228ns 1.02 0.311 0.0957 6.06 KB
#6147 StartStopWithChild net6.0 7.62μs 42.8ns 290ns 0.0214 0.0129 0.00429 5.42 KB
#6147 StartStopWithChild netcoreapp3.1 9.96μs 54.9ns 329ns 0.0147 0.00489 0 5.62 KB
#6147 StartStopWithChild net472 16μs 37.8ns 146ns 1.02 0.294 0.0955 6.07 KB
Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 478μs 127ns 459ns 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 653μs 493ns 1.91μs 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces net472 837μs 387ns 1.45μs 0.422 0 0 3.3 KB
#6147 WriteAndFlushEnrichedTraces net6.0 473μs 138ns 517ns 0 0 0 2.7 KB
#6147 WriteAndFlushEnrichedTraces netcoreapp3.1 653μs 522ns 2.02μs 0 0 0 2.7 KB
#6147 WriteAndFlushEnrichedTraces net472 837μs 603ns 2.34μs 0.419 0 0 3.3 KB
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendRequest net6.0 206μs 1.55μs 15.5μs 0.212 0 0 18.45 KB
master SendRequest netcoreapp3.1 231μs 1.35μs 12.6μs 0.216 0 0 20.61 KB
master SendRequest net472 0.000903ns 0.000508ns 0.00197ns 0 0 0 0 b
#6147 SendRequest net6.0 202μs 1.19μs 11.3μs 0.21 0 0 18.45 KB
#6147 SendRequest netcoreapp3.1 239μs 1.52μs 15.1μs 0.232 0 0 20.61 KB
#6147 SendRequest net472 0.000473ns 0.000331ns 0.00119ns 0 0 0 0 b
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ More allocations ⚠️

More allocations ⚠️ in #6147

Benchmark Base Allocated Diff Allocated Change Change %
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces‑netcoreapp3.1 41.49 KB 41.71 KB 221 B 0.53%

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 598μs 3.4μs 23.6μs 0.563 0 0 41.7 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 767μs 4.24μs 26.2μs 0.383 0 0 41.49 KB
master WriteAndFlushEnrichedTraces net472 875μs 2.76μs 10.3μs 8.33 2.31 0.463 53.34 KB
#6147 WriteAndFlushEnrichedTraces net6.0 606μs 3.33μs 20.3μs 0.568 0 0 41.71 KB
#6147 WriteAndFlushEnrichedTraces netcoreapp3.1 706μs 3.74μs 21.8μs 0.355 0 0 41.71 KB
#6147 WriteAndFlushEnrichedTraces net472 872μs 4.15μs 17.6μs 8.45 2.53 0.422 53.36 KB
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteNonQuery net6.0 1.36μs 0.76ns 2.84ns 0.0143 0 0 1.02 KB
master ExecuteNonQuery netcoreapp3.1 1.79μs 1.54ns 5.76ns 0.0135 0 0 1.02 KB
master ExecuteNonQuery net472 2.12μs 1.71ns 6.61ns 0.157 0 0 987 B
#6147 ExecuteNonQuery net6.0 1.36μs 0.817ns 2.95ns 0.0143 0 0 1.02 KB
#6147 ExecuteNonQuery netcoreapp3.1 1.73μs 1.62ns 6.08ns 0.0138 0 0 1.02 KB
#6147 ExecuteNonQuery net472 2.07μs 0.806ns 2.91ns 0.156 0.00104 0 987 B
Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master CallElasticsearch net6.0 1.19μs 2.91ns 11.3ns 0.0135 0 0 976 B
master CallElasticsearch netcoreapp3.1 1.59μs 0.604ns 2.26ns 0.0128 0 0 976 B
master CallElasticsearch net472 2.59μs 1.14ns 4.4ns 0.157 0 0 995 B
master CallElasticsearchAsync net6.0 1.3μs 0.557ns 2.16ns 0.0136 0 0 952 B
master CallElasticsearchAsync netcoreapp3.1 1.66μs 0.745ns 2.79ns 0.0141 0 0 1.02 KB
master CallElasticsearchAsync net472 2.47μs 1.53ns 5.94ns 0.167 0 0 1.05 KB
#6147 CallElasticsearch net6.0 1.3μs 1.64ns 6.13ns 0.0133 0 0 976 B
#6147 CallElasticsearch netcoreapp3.1 1.55μs 0.625ns 2.34ns 0.0132 0 0 976 B
#6147 CallElasticsearch net472 2.62μs 2.17ns 8.39ns 0.158 0 0 995 B
#6147 CallElasticsearchAsync net6.0 1.29μs 0.556ns 2.08ns 0.0136 0 0 952 B
#6147 CallElasticsearchAsync netcoreapp3.1 1.57μs 0.701ns 2.72ns 0.014 0 0 1.02 KB
#6147 CallElasticsearchAsync net472 2.52μs 1.39ns 5.4ns 0.166 0 0 1.05 KB
Benchmarks.Trace.GraphQLBenchmark - Faster 🎉 Same allocations ✔️

Faster 🎉 in #6147

Benchmark base/diff Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.GraphQLBenchmark.ExecuteAsync‑net6.0 1.148 1,379.44 1,202.07

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteAsync net6.0 1.38μs 1.03ns 3.99ns 0.0131 0 0 952 B
master ExecuteAsync netcoreapp3.1 1.67μs 1.8ns 6.74ns 0.0123 0 0 952 B
master ExecuteAsync net472 1.78μs 0.713ns 2.76ns 0.145 0 0 915 B
#6147 ExecuteAsync net6.0 1.2μs 0.525ns 1.96ns 0.0132 0 0 952 B
#6147 ExecuteAsync netcoreapp3.1 1.69μs 0.847ns 3.17ns 0.0129 0 0 952 B
#6147 ExecuteAsync net472 1.78μs 0.595ns 2.3ns 0.145 0 0 915 B
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Fewer allocations 🎉

Fewer allocations 🎉 in #6147

Benchmark Base Allocated Diff Allocated Change Change %
Benchmarks.Trace.HttpClientBenchmark.SendAsync‑net472 3.15 KB 3.07 KB -73 B -2.32%

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendAsync net6.0 4.18μs 2.36ns 9.13ns 0.031 0 0 2.22 KB
master SendAsync netcoreapp3.1 5.01μs 1.48ns 5.54ns 0.0378 0 0 2.76 KB
master SendAsync net472 7.91μs 1.26ns 4.72ns 0.498 0 0 3.15 KB
#6147 SendAsync net6.0 4.31μs 6.64ns 24.8ns 0.0315 0 0 2.22 KB
#6147 SendAsync netcoreapp3.1 5.03μs 2.1ns 8.15ns 0.0378 0 0 2.76 KB
#6147 SendAsync net472 7.37μs 1.18ns 4.58ns 0.488 0 0 3.07 KB
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 1.51μs 0.841ns 3.03ns 0.0228 0 0 1.64 KB
master EnrichedLog netcoreapp3.1 2.2μs 1.81ns 6.75ns 0.0222 0 0 1.64 KB
master EnrichedLog net472 2.68μs 1.03ns 3.98ns 0.249 0 0 1.57 KB
#6147 EnrichedLog net6.0 1.44μs 0.751ns 2.81ns 0.023 0 0 1.64 KB
#6147 EnrichedLog netcoreapp3.1 2.26μs 1.06ns 3.98ns 0.0215 0 0 1.64 KB
#6147 EnrichedLog net472 2.55μs 1.26ns 4.72ns 0.249 0 0 1.57 KB
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 117μs 221ns 856ns 0 0 0 4.28 KB
master EnrichedLog netcoreapp3.1 121μs 273ns 1.02μs 0.0597 0 0 4.28 KB
master EnrichedLog net472 150μs 188ns 704ns 0.673 0.224 0 4.46 KB
#6147 EnrichedLog net6.0 115μs 107ns 401ns 0.0586 0 0 4.28 KB
#6147 EnrichedLog netcoreapp3.1 122μs 133ns 516ns 0 0 0 4.28 KB
#6147 EnrichedLog net472 152μs 136ns 528ns 0.685 0.228 0 4.46 KB
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 3.08μs 1.03ns 3.98ns 0.0313 0 0 2.2 KB
master EnrichedLog netcoreapp3.1 4.23μs 1.69ns 6.56ns 0.0297 0 0 2.2 KB
master EnrichedLog net472 4.7μs 1.57ns 6.09ns 0.32 0 0 2.02 KB
#6147 EnrichedLog net6.0 3.14μs 1.09ns 4.21ns 0.0297 0 0 2.2 KB
#6147 EnrichedLog netcoreapp3.1 4.26μs 1.4ns 5.42ns 0.0296 0 0 2.2 KB
#6147 EnrichedLog net472 5.01μs 1.2ns 4.49ns 0.32 0 0 2.02 KB
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendReceive net6.0 1.4μs 0.902ns 3.49ns 0.0161 0 0 1.14 KB
master SendReceive netcoreapp3.1 1.77μs 2.36ns 9.13ns 0.015 0 0 1.14 KB
master SendReceive net472 2.15μs 0.688ns 2.48ns 0.184 0.00107 0 1.16 KB
#6147 SendReceive net6.0 1.37μs 0.694ns 2.69ns 0.0156 0 0 1.14 KB
#6147 SendReceive netcoreapp3.1 1.77μs 0.55ns 2.13ns 0.0149 0 0 1.14 KB
#6147 SendReceive net472 2.15μs 0.939ns 3.64ns 0.183 0 0 1.16 KB
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 2.55μs 0.732ns 2.83ns 0.0226 0 0 1.6 KB
master EnrichedLog netcoreapp3.1 3.94μs 1.57ns 5.87ns 0.022 0 0 1.65 KB
master EnrichedLog net472 4.36μs 1.31ns 4.89ns 0.322 0 0 2.04 KB
#6147 EnrichedLog net6.0 2.84μs 1.23ns 4.62ns 0.0226 0 0 1.6 KB
#6147 EnrichedLog netcoreapp3.1 3.82μs 2.55ns 9.89ns 0.021 0 0 1.65 KB
#6147 EnrichedLog net472 4.49μs 2.58ns 9.67ns 0.324 0 0 2.04 KB
Benchmarks.Trace.SpanBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartFinishSpan net6.0 400ns 0.414ns 1.6ns 0.00801 0 0 576 B
master StartFinishSpan netcoreapp3.1 586ns 0.379ns 1.47ns 0.00756 0 0 576 B
master StartFinishSpan net472 720ns 0.739ns 2.86ns 0.0917 0 0 578 B
master StartFinishScope net6.0 486ns 0.378ns 1.46ns 0.00984 0 0 696 B
master StartFinishScope netcoreapp3.1 762ns 4.05ns 19.8ns 0.00946 0 0 696 B
master StartFinishScope net472 859ns 0.956ns 3.7ns 0.104 0 0 658 B
#6147 StartFinishSpan net6.0 416ns 0.311ns 1.2ns 0.00809 0 0 576 B
#6147 StartFinishSpan netcoreapp3.1 619ns 0.726ns 2.81ns 0.00768 0 0 576 B
#6147 StartFinishSpan net472 725ns 0.761ns 2.95ns 0.0918 0 0 578 B
#6147 StartFinishScope net6.0 493ns 0.749ns 2.9ns 0.00982 0 0 696 B
#6147 StartFinishScope netcoreapp3.1 791ns 0.576ns 2.23ns 0.0091 0 0 696 B
#6147 StartFinishScope net472 857ns 0.633ns 2.37ns 0.104 0 0 658 B
Benchmarks.Trace.TraceAnnotationsBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master RunOnMethodBegin net6.0 659ns 1.67ns 6.47ns 0.00979 0 0 696 B
master RunOnMethodBegin netcoreapp3.1 942ns 1.28ns 4.44ns 0.00906 0 0 696 B
master RunOnMethodBegin net472 1.08μs 0.95ns 3.68ns 0.104 0 0 658 B
#6147 RunOnMethodBegin net6.0 660ns 0.409ns 1.58ns 0.00997 0 0 696 B
#6147 RunOnMethodBegin netcoreapp3.1 916ns 0.575ns 2.23ns 0.00931 0 0 696 B
#6147 RunOnMethodBegin net472 1.1μs 0.651ns 2.52ns 0.104 0 0 658 B

@andrewlock andrewlock force-pushed the andrew/fix-recurrsion-in-aspnet branch from 06a7a15 to 69ab96d Compare October 15, 2024 16:41
Copy link
Collaborator

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question on a code comment for future readers. Great fix!

@andrewlock andrewlock added the status:do-not-merge Work is done. Can review, but do not merge yet. label Oct 16, 2024
andrewlock added a commit that referenced this pull request Oct 21, 2024
## Summary of changes

- Adds the process ID to the managed logs filename
- ~Enables buffering in .NET Core~ Reverted this change, as not required
and caused test failures. Should be invetigated later

## Reason for change

While investigating #6147, we discovered that if you have two app pools,
running under different users, the second app domain is unable to write
managed logs. This is because it's unable to hold the shared mutex
that's used to synchronize access to the file.

## Implementation details

Include the pID in the filename, so each process writes to a separate
file.

<details><summary>I originally enabled buffering, but removed
it</summary>
<p>
As a consequence, if we only have one tracer instance writing to a
managed file, then we no longer need to have a shared logger, and can
enable buffering, which should improve performance by buffering writes
to the file.
</p>
<ul>
<li>We can't assume this in .NET Framework, as we can have multiple app
domains writing to the same file, so this is only enabled for .NET
Core.</li>
<li>_Theoretically_ this is a problem in version conflict, where we
could have a v2 tracer loaded at the same time as a v3 tracer. However,
as this PR changes the file name to differ from v2, that's not an issue,
so we should still be safe</li>
<li>For the buffering, we need to choose a flush interval. I plucked 1s
out of the sky - any other suggestions?</li>

</details> 

However, I reverted this change as it broke the dynamic instrumentation
tests, because they ended up reading partial lines. We _should_ fix this
so that we can get the perf benefit of buffering etc, but it's
tangential to the bug fix, so removed for now.



## Test coverage

Yeah... the existing integration tests obviously write a bunch of
logs... any suggestion for additional tests we should add would be
gratefully received...

## Other details

I also updated the file-rolling behaviour. Having daily file rolling _in
addition_ to split by pid _and_ file size limit seemed more confusing
than anything else.

- For customers that have multi-day-long processes with high log volume,
the results should be broadly the same, as they're limited by rolling
file size
- For customers that have multi-day-long processes with low log volume,
log volume overall may increase, as they won't roll _until_ they hit the
log limit? I think, this one I'm struggling to visualize
- For customers with short-lived processes, they will have more _files_
but these will use the same amount of space.
@andrewlock andrewlock requested review from a team as code owners October 22, 2024 12:38
Copy link
Contributor

@bouwkast bouwkast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks a lot for all of the detailed descriptions here!

Wondering if we should somehow add this reproduction app to the repo? Can look at that later though.

@andrewlock
Copy link
Member Author

Wondering if we should somehow add this reproduction app to the repo? Can look at that later though.

Absolutely @bouwkast, I have started looking at this, but it's a bit of a pain, and didn't want to block the fix going in given it's manually tested 😅 I will work on adding the repro + test subsequently ASAP though

andrewlock added a commit that referenced this pull request Oct 29, 2024
## Summary of changes

Fixes the "LoaderOptimization" IIS tests

## Reason for change

While working on a separate issue, discovered that the LoaderOptimzation
tests (in the `msi_integration_tests` stage aren't testing anything
because the MSI isn't _actually_ being installed 🤦

## Implementation details

The "fix" is 
- `dotnet_tracer_msi` => `DOTNET_TRACER_MSI` in the `docker compose
build` step
- Add the msi to the `.dockerignore` so that it's actually added to the
container 😬

Also took the chance to refactor this a little too (in preparation for a
stacked PR)

- Map the logs directory inside the container into `build_data/logs`
- Rename the generic "IIS" name in the docker file/docker-compose to
flag that it's explicitly for testing the LoaderOptimization behaviour
- Added a check to the end of the smoke tests that makes sure we have
some tracing logs. We could be more specific about this, run an agent
inside the container etc, but this is better than nothing and is simple
for now

## Test coverage

Discovered this wasn't working in a separate PR, so as long as the tests
pass (and the artifacts for the msi stage confirm we _do_ have logs,
then we should be good

## Other details

Prerequiste for testing the behaviour [in this
PR](#6147)
@andrewlock andrewlock merged commit e8cd262 into master Oct 31, 2024
62 checks passed
@andrewlock andrewlock deleted the andrew/fix-recurrsion-in-aspnet branch October 31, 2024 16:11
@github-actions github-actions bot added this to the vNext-v3 milestone Oct 31, 2024
@andrewlock andrewlock added identified-by:customer and removed status:do-not-merge Work is done. Can review, but do not merge yet. labels Oct 31, 2024
@andrewlock
Copy link
Member Author

Wondering if we should somehow add this reproduction app to the repo? Can look at that later though.

FYI @bouwkast, I added this in:

andrewlock added a commit that referenced this pull request Nov 4, 2024
## Summary of changes

Adds a smoke test for the issue fixed in #6147 

## Reason for change

The issue in #6147 is complex and subtle, with a high risk of
regression, so we want to have smoke tests to catch it`

## Implementation details

Add a sample app very similar to the one tested with in #6147. Without
the fix, [the test fails
with](https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=166792&view=results)

```html
<b> Description: </b>An error occurred during the processing of a configuration file required to service this request. Please review the specific error details below and modify your configuration file appropriately.
<br><br>
<b> Parser Error Message: </b>The configBuilder 'CustomBuilder' failed while processing the configuration section 'appSettings'.: The ConfigurationBuilder 'CustomBuilder[Microsoft.Configuration.ConfigurationBuilders.CustomConfigBuilder]' has recursively re-entered processing of the 'appSettings' section.<br><br>
```

After the fix (now merged) the tests pass

## Test coverage

This is primarily a smoke test, so we confirm
- The sites are running and responding to HTTP requests correctly
- We instrument the application (it has logs)
- There are no errors in the logs

## Other details

For some reason, I had massive issues getting the app to "recognize" the
`DD_TRACE_AGENT_URL=http://test-agent.windows:8126` env var setting
that's required to talk to the test agent. I have no idea why, and the
only resolution I could find was to use powershell to set the variables
inside the docker image

---------

Co-authored-by: Kevin Gosse <[email protected]>
andrewlock added a commit that referenced this pull request Nov 15, 2024
…trust (#6290)

## Summary of changes

Update the `CallTargetInvoker` to explicitly bail-out if partial-trust
is detected.

## Reason for change

We don't support partial trust. In the native profiler, we have code
that explicitly bails out if we're in partial trust. Unfortunately...
IIS.

In IIS, we could have two apps running in the same app pool&mdash;one in
full trust, one in partial trust. The full trust app will be
instrumented as normal. Unfortunately, because they're running in the
same app pool, _the partial trust app is effectively also instrumented_.

In this scenario, we'll likely create errors in the partial trust app
(we have seen `MemberAccessException`s for example), and we still can't
_really_ work there.

## Implementation details

The problem is essentially the same as a recent fixed issue (#6147) -
multiple apps in the same pool causing oddities with instrumentation.
The fix in this PR piggy-backs in the work there, essentially

- We check whether the app domain is in partial trust.
- If it is, we bail out of running any CallTarget instrumentation.

This should solve the issue for calltarget and for manual
instrumentation (which is just calltarget in v3).

The issue remains for callsite instrumentation unfortunately, and likely
cannot easily be mitigated @daniel-romano-DD

## Test coverage

_sigh_ I spent all day trying to create an aspnet app that I could run
in partial trust. All day. I failed. If anyone can produce one that will
run in IIS under partial trust, then I have the setup prepared.

Unfortunately, as I can't even pass that low bar, I can't even
_manually_ test this 🙁 All I can confirm is that it doesn't affect "full
trust" applications that we use in CI 😅

## Other details

Seriously, can _anyone_ get a partial trust app running? 🤷‍♂️
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) identified-by:customer type:bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants