Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 1ES CI for DTFx.AS v2 #1139

Merged
merged 35 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
6807f98
increase timeouts
davidmrdavid Jul 13, 2024
7bbfb20
pass 1ES CI
davidmrdavid Jul 16, 2024
5341508
resolve warning
davidmrdavid Jul 16, 2024
dd49cf6
Fix indent
davidmrdavid Jul 16, 2024
4bc59ce
test removing pkg
davidmrdavid Jul 16, 2024
b2bd2f9
re-add dep
davidmrdavid Jul 16, 2024
bc9a5ea
add TODOs
davidmrdavid Jul 17, 2024
aa77bd5
testing the removal of netcoreapp3.1 in test
davidmrdavid Jul 17, 2024
56984ce
test method rename
davidmrdavid Jul 17, 2024
eef1d03
try renaming clean up
davidmrdavid Jul 17, 2024
b45e013
try conditional compilation
davidmrdavid Jul 17, 2024
5b7b1d8
simplify test assembly paths
davidmrdavid Jul 17, 2024
aa060c2
test
davidmrdavid Jul 17, 2024
941decc
adapt test clean up and initializers to work better with CI
davidmrdavid Jul 17, 2024
1e1275d
clean out NET462
davidmrdavid Jul 17, 2024
15f44df
try mitigation
davidmrdavid Jul 17, 2024
1c0c4d2
rev deps
davidmrdavid Jul 17, 2024
b1eec32
rev logging abstractions
davidmrdavid Jul 17, 2024
345fc27
add back compilerServices.Unsafe 6.x
davidmrdavid Jul 17, 2024
e791819
remove binding redirect in app.config
davidmrdavid Jul 17, 2024
01ba1b2
keep diagnosticsource low
davidmrdavid Jul 18, 2024
896fcc5
increase diagnostics source
davidmrdavid Jul 18, 2024
5df3df2
clean up
davidmrdavid Jul 18, 2024
dd7c604
diagnostic source to 6.0.01
davidmrdavid Jul 18, 2024
f1b0f72
rev diagnostic source
davidmrdavid Jul 18, 2024
de3761a
rev Azure.Core
davidmrdavid Jul 18, 2024
4bf3039
rev logging abstractions
davidmrdavid Jul 18, 2024
2e7d199
rev logging abstractions in azstorage
davidmrdavid Jul 18, 2024
1dda1ab
decrease logging change, try again to run all tests
davidmrdavid Jul 18, 2024
be16e51
add parens
davidmrdavid Jul 18, 2024
cb72151
simplify signature
davidmrdavid Jul 18, 2024
1f56a2d
try v3
davidmrdavid Jul 18, 2024
bd915ec
remove parallelism in DT.Core
davidmrdavid Jul 18, 2024
4d4dda0
get tests to pass
davidmrdavid Jul 18, 2024
275b3dd
re-enable all partition manager tests
davidmrdavid Jul 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions eng/ci/public-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@ extends:
# Run tests
- template: /eng/templates/test.yml@self
parameters:
testAssembly: '**\bin\**\netcoreapp3.1\DurableTask.Core.Tests.dll'
# TODO: for some reason we're unable to run the net462 tests in the CI
testAssembly: '**\bin\Debug\**\DurableTask.Core.Tests.dll'
- stage: DTFxASValidate
dependsOn: []
jobs:
Expand All @@ -86,7 +87,7 @@ extends:
# Run tests
- template: /eng/templates/test.yml@self
parameters:
testAssembly: '**\bin\**\netcoreapp3.1\DurableTask.AzureStorage.Tests.dll'
testAssembly: '**\bin\Debug\**\DurableTask.AzureStorage.Tests.dll'
- stage: DTFxEmulatorValidate
dependsOn: []
jobs:
Expand All @@ -105,4 +106,4 @@ extends:
# Run tests
- template: /eng/templates/test.yml@self
parameters:
testAssembly: '**\bin\**\netcoreapp3.1\DurableTask.Emulator.Tests.dll'
testAssembly: '**\bin\Debug\**\DurableTask.Emulator.Tests.dll'
24 changes: 11 additions & 13 deletions eng/templates/test.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the changes that were made in this file?

Copy link
Collaborator Author

@davidmrdavid davidmrdavid Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So previously I had:

  - task: VSTest@2
    displayName: 'Run tests'
    inputs:
      testAssemblyVer2: |
        ${{ parameters.testAssembly }}
        !**\obj\**
        testFiltercriteria: 'TestCategory!=DisabledInCI'
        vsTestVersion: 17.0
        distributionBatchType: basedOnExecutionTime
        platform: 'any cpu'
        configuration: 'Debug'
        diagnosticsEnabled: True
        collectDumpOn: always
        rerunFailedTests: true
        rerunFailedThreshold: 30
        rerunMaxAttempts: 3

As you can see, everything below testAssemblyVer2 is at the same level indentation. That's an issue, since testFiltercriteria and the params below are not an input to testAssemblyVer2, instead they're properties that are supposed to be at the same level. So I've indented them back/out so that they're properly parsed by 1ES.

So now I have something equivalent to this (notice the indent)

  - task: VSTest@2
    displayName: 'Run tests'
    inputs:
      testAssemblyVer2: |
        ${{ parameters.testAssembly }}
        !**\obj\**
      testFiltercriteria: 'TestCategory!=DisabledInCI'
      vsTestVersion: 17.0
      distributionBatchType: basedOnExecutionTime
      platform: 'any cpu'
      configuration: 'Debug'
      diagnosticsEnabled: True
      collectDumpOn: always
      rerunFailedTests: true
      rerunFailedThreshold: 30
      rerunMaxAttempts: 3

Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,14 @@ steps:
- task: VSTest@2
displayName: 'Run tests'
inputs:
testAssemblyVer2: |
${{ parameters.testAssembly }}
!**\obj\**
testFiltercriteria: 'TestCategory!=DisabledInCI'
vsTestVersion: 17.0
distributionBatchType: basedOnExecutionTime
platform: 'any cpu'
configuration: 'Debug'
diagnosticsEnabled: True
collectDumpOn: always
rerunFailedTests: true
rerunFailedThreshold: 30
rerunMaxAttempts: 3
testAssemblyVer2: ${{ parameters.testAssembly }}
testFiltercriteria: 'TestCategory!=DisabledInCI'
vsTestVersion: 17.0
distributionBatchType: basedOnExecutionTime
platform: 'any cpu'
configuration: 'Debug'
diagnosticsEnabled: True
collectDumpOn: always
rerunFailedTests: true
rerunFailedThreshold: 30
rerunMaxAttempts: 3
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Azure.Core" Version="1.40.0" />
<PackageReference Include="Azure.Core" Version="1.41.0" />
cgillum marked this conversation as resolved.
Show resolved Hide resolved
<PackageReference Include="Azure.Data.Tables" Version="12.8.3" />
<PackageReference Include="Azure.Storage.Blobs" Version="12.20.0" />
<PackageReference Include="Azure.Storage.Queues" Version="12.18.0" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ public override async IAsyncEnumerable<OrchestrationState> GetStateAsync(IEnumer
yield break;
}

IEnumerable<Task<OrchestrationState>> instanceQueries = instanceIds.Select(instance => this.GetStateAsync(instance, allExecutions: true, fetchInput: false, cancellationToken).SingleAsync().AsTask());
IEnumerable<Task<OrchestrationState>> instanceQueries = instanceIds.Select(instance => this.GetStateAsync(instance, allExecutions: true, fetchInput: false, cancellationToken).SingleOrDefaultAsync().AsTask());
cgillum marked this conversation as resolved.
Show resolved Hide resolved
foreach (OrchestrationState state in await Task.WhenAll(instanceQueries))
{
if (state != null)
Expand Down
1 change: 1 addition & 0 deletions src/DurableTask.Core/DurableTask.Core.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
<PackageReference Include="Microsoft.Extensions.Logging.Abstractions" Version="2.2.0" />
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
<PackageReference Include="System.Diagnostics.DiagnosticSource" Version="5.0.1" />
<PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="6.0.0" />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going v3 with DTFx, I think we should try and just upgrade some packages here:

Microsoft.Extensions.Logging.Abstractions -> 6.0.0
System.Diagnostics.DiagnosticSource -> 6.0.0

See if that fixes your problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe that's it. I'll try to diagnostics source change first, that could be it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No luck with just upgrading DiagnosticsSource -> https://dev.azure.com/azfunc/public/_build/results?buildId=173059&view=ms.vss-test-web.build-test-results-tab .

Let me try to rev the logging abstractions package as well next

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still no luck that failed too -> https://dev.azure.com/azfunc/public/_build/results?buildId=173069&view=ms.vss-test-web.build-test-results-tab

I'll keep the packages on the 6.x version, and also include the compilerServices.Unsafe pkg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the System.Diagnostics.DiagnosticSource pkg to 6.0.1 and that indeed fixed the issue :-) .
I did opt against increasing the logging abstractions package just because I want to avoid too many transitive dependency changes that may create issues unifying with the rest of bundles. If we have reason to think this is fine, I'm happy to add that in a follow up PR. For now though, I'd like to proceed as-is to unblock a few PRs.

<PackageReference Include="System.Reactive.Core" Version="4.4.1" />
<PackageReference Include="System.Reactive.Compatibility" Version="4.4.1" />
<PackageReference Include="Castle.Core" Version="5.0.0" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2252,6 +2252,7 @@ await Task.WhenAll(
/// End-to-end test which validates a simple orchestrator function that calls an activity function
/// and checks the OpenTelemetry trace information
/// </summary>
[TestCategory("DisabledInCI")]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these disabled tests have not been running for a long time, and they're failing in the main branch as well. They weren't running because they're conditionally compiled not to run in NET462, and our previous testing task in ADO was configured to run tests found in the net462 dll.

Now, the test task does discover these tests, but I've disabled them as they aren't passing. @bachuv can you please add this as a To-Fix before Dist. Tracing GA?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add this task to the list.

[DataTestMethod]
[DataRow(true)]
[DataRow(false)]
Expand Down Expand Up @@ -2344,6 +2345,7 @@ public async Task OpenTelemetry_SayHelloWithActivity(bool enableExtendedSessions
/// End-to-end test which validates a simple orchestrator function that waits for an external event
/// raised through the RaiseEvent API and checks the OpenTelemetry trace information
/// </summary>
[TestCategory("DisabledInCI")]
[DataTestMethod]
[DataRow(true)]
[DataRow(false)]
Expand Down Expand Up @@ -2445,6 +2447,7 @@ public async Task OpenTelemetry_ExternalEvent_RaiseEvent(bool enableExtendedSess
/// End-to-end test which validates a simple orchestrator function that waits for an external event
/// raised by calling SendEvent and checks the OpenTelemetry trace information
/// </summary>
[TestCategory("DisabledInCI")]
[DataTestMethod]
[DataRow(true)]
[DataRow(false)]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,31 @@
<TargetFrameworks>netcoreapp3.1;net462</TargetFrameworks>
</PropertyGroup>


<ItemGroup Condition="'$(TargetFramework)' == 'net462'">
<PackageReference Include="Microsoft.Extensions.Logging" Version="2.2.0" />
<PackageReference Include="Microsoft.WindowsAzure.ConfigurationManager" version="3.2.1" />
<PackageReference Include="WindowsAzure.Storage" version="8.7.0" />
</ItemGroup>

<ItemGroup Condition="'$(TargetFramework)' != 'net462'">
<PackageReference Include="Azure.Monitor.OpenTelemetry.Exporter" Version="1.0.0-beta.3" />
<PackageReference Include="Microsoft.Extensions.Logging" Version="3.1.6" />
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newtonsoft.json is inherited by the DTFx.Core project

<PackageReference Include="System.Configuration.ConfigurationManager" Version="4.5.0" />
<PackageReference Include="WindowsAzure.Storage" version="9.3.3" />
</ItemGroup>

<!-- Common package dependencies -->
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.Logging" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="2.2.0" />
Comment on lines +21 to +22
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this Logging.Console dependency because I've configured the tests to emit traces to STDOUT, which makes debugging failed ADO tests much easier. I decided to unify the dependencies to version 2.2.0 for simplicy sake and to avoid conditional compilation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2.0 is quite old. Can we standardize to 6.0.0 as part of the v3 change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some issues getting the code to build on 1ES with more recent versions. I'm open to upgrading, but since this is the test project, I'd like to push that to be a separate PR.

<PackageReference Include="Microsoft.ApplicationInsights.DependencyCollector" Version="2.12.0" />
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="15.9.0" />
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.10.0" />
<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" />
<PackageReference Include="MSTest.TestFramework" Version="1.4.0" />
<PackageReference Include="Moq" Version="4.10.0" />

<!-- We don't really make use of these dependencies, but 1ES somehow detects them in our test project and flags them for update.
Since this is just a test project, and to prevent "warning fatigue" so real CVEs stand out, we choose to upgrade the dependency explicitly to remove the false alarm.-->
<PackageReference Include="System.Data.SqlClient" Version="4.8.6" /> <!-- Version 4.3.1 was detected by default-->
<PackageReference Include="Microsoft.Data.Services.Client" Version="5.8.4" /> <!-- Without this, we resolve Microsoft.Data.OData 5.8.2, which is flagged-->
</ItemGroup>

<ItemGroup>
Expand All @@ -48,9 +51,6 @@
<None Update="large.jpeg">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="testhost.dll.config" Condition="'$(TargetFramework)' == 'netcoreapp2.1'">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
Comment on lines -51 to -53
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a netcoreapp2.1 TFM

</ItemGroup>

</Project>
6 changes: 4 additions & 2 deletions test/DurableTask.AzureStorage.Tests/TestHelpers.cs
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ public static TestOrchestrationHost GetTestOrchestrationHost(
{
string storageConnectionString = GetTestStorageAccountConnectionString();

// TODO: update Microsoft.Extensions.Logging to avoid the following warning suppression
#pragma warning disable CS0618 // Type or member is obsolete
var settings = new AzureStorageOrchestrationServiceSettings
{
ExtendedSessionIdleTimeout = TimeSpan.FromSeconds(extendedSessionTimeoutInSeconds),
Expand All @@ -40,9 +42,9 @@ public static TestOrchestrationHost GetTestOrchestrationHost(
TaskHubName = GetTestTaskHubName(),

// Setting up a logger factory to enable the new DurableTask.Core logs
// TODO: Add a logger provider so we can collect these logs in memory.
LoggerFactory = new LoggerFactory(),
LoggerFactory = new LoggerFactory().AddConsole(LogLevel.Trace),
};
#pragma warning restore CS0618 // Type or member is obsolete
Comment on lines -43 to +47
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned, I've added logging to STDOUT on our tests to facilitate debugging. The current logging library we use is pretty old, which is why we have to disable the warning here. I tried using a newer library, but then had issues with the NET462 TFM. For speed, I decided to use the old library instead, which works.


// Give the caller a chance to make test-specific changes to the settings
modifySettingsAction?.Invoke(settings);
Expand Down
13 changes: 11 additions & 2 deletions test/DurableTask.Core.Tests/DispatcherMiddlewareTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ namespace DurableTask.Core.Tests
using DurableTask.Core.History;
using DurableTask.Emulator;
using DurableTask.Test.Orchestrations;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Console;
using Microsoft.VisualStudio.TestTools.UnitTesting;

[TestClass]
Expand All @@ -36,10 +38,17 @@ public class DispatcherMiddlewareTests
TaskHubClient client = null!;

[TestInitialize]
public async Task Initialize()
public async Task InitializeTests()
{
// configure logging so traces are emitted during tests.
// This facilitates debugging when tests fail.

// TODO: update Microsoft.Extensions.Logging to avoid the following warning suppression
#pragma warning disable CS0618 // Type or member is obsolete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments here with reminders to update these dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just did here -> bc9a5ea

var loggerFactory = new LoggerFactory().AddConsole(LogLevel.Trace);
#pragma warning restore CS0618 // Type or member is obsolete
var service = new LocalOrchestrationService();
this.worker = new TaskHubWorker(service);
this.worker = new TaskHubWorker(service, loggerFactory);

await this.worker
.AddTaskOrchestrations(typeof(SimplestGreetingsOrchestration), typeof(ParentWorkflow), typeof(ChildWorkflow))
Expand Down
12 changes: 3 additions & 9 deletions test/DurableTask.Core.Tests/DurableTask.Core.Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,13 @@
<TargetFrameworks>netcoreapp3.1;net462</TargetFrameworks>
</PropertyGroup>

<ItemGroup Condition="'$(TargetFramework)' == 'net462'">
<PackageReference Include="EnterpriseLibrary.SemanticLogging.EventSourceAnalyzer" Version="2.0.1406.1" />
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="15.9.0" />
<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" />
<PackageReference Include="MSTest.TestFramework" Version="1.4.0" />
</ItemGroup>

Comment on lines -8 to -14
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was no reason for separate dependencies anymore

<ItemGroup Condition="'$(TargetFramework)' == 'netcoreapp3.1'">
<ItemGroup>
<PackageReference Include="EnterpriseLibrary.SemanticLogging.EventSourceAnalyzer.NetCore" Version="2.0.1406.4" />
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="15.9.0" />
<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" />
<PackageReference Include="MSTest.TestFramework" Version="1.4.0" />
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
<PackageReference Include="Microsoft.Extensions.Logging" Version="2.2.0" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="2.2.0" />
</ItemGroup>

<ItemGroup>
Expand Down
13 changes: 11 additions & 2 deletions test/DurableTask.Core.Tests/ExceptionHandlingIntegrationTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ namespace DurableTask.Core.Tests
using System.Threading.Tasks;
using DurableTask.Core.Exceptions;
using DurableTask.Emulator;
using Microsoft.Extensions.Logging;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Newtonsoft.Json;

Expand All @@ -32,9 +33,17 @@ public class ExceptionHandlingIntegrationTests

public ExceptionHandlingIntegrationTests()
{
// configure logging so traces are emitted during tests.
// This facilitates debugging when tests fail.

// TODO: update Microsoft.Extensions.Logging to avoid the following warning suppression
#pragma warning disable CS0618 // Type or member is obsolete
var loggerFactory = new LoggerFactory().AddConsole(LogLevel.Trace);
#pragma warning restore CS0618 // Type or member is obsolete

var service = new LocalOrchestrationService();
this.worker = new TaskHubWorker(service);
this.client = new TaskHubClient(service);
this.worker = new TaskHubWorker(service, loggerFactory);
this.client = new TaskHubClient(service, loggerFactory: loggerFactory);
}

[DataTestMethod]
Expand Down
4 changes: 2 additions & 2 deletions test/DurableTask.Core.Tests/RetryInterceptorTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ public async Task Invoke_WithFailingRetryCall_ShouldThrowCorrectException()
await Assert.ThrowsExceptionAsync<IOException>(Invoke, "Interceptor should throw the original exception after exceeding max retry attempts.");
}

[TestMethod]
[DataTestMethod]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was throwing a warning. TestMethods cannot have DataRows, for that you need DataTestMethod instead. The VSTesting task is forgiving enough here, but I changed this to remove the warning

[DataRow(1)]
[DataRow(2)]
[DataRow(3)]
Expand Down Expand Up @@ -59,7 +59,7 @@ public async Task Invoke_WithFailingRetryCall_ShouldHaveCorrectNumberOfCalls(int
Assert.AreEqual(maxAttempts, callCount, 0, $"There should be {maxAttempts} function calls for {maxAttempts} max attempts.");
}

[TestMethod]
[DataTestMethod]
[DataRow(1)]
[DataRow(2)]
[DataRow(3)]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

<ItemGroup>
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.7.1" />
<PackageReference Include="MSTest.TestAdapter" Version="3.1.1" />
<PackageReference Include="MSTest.TestFramework" Version="3.1.1" />
<PackageReference Include="MSTest.TestAdapter" Version="1.4.0" />
<PackageReference Include="MSTest.TestFramework" Version="1.4.0" />
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an inconsistent use of MSTest, some use version 1.4.0 and others use versoin 3.1.1. The use of different frameworks led to warnings. I've downgraded to the common denominator, as I encountered issues upgrading them all to 3.1.1 (can't remember which unfortunately).

</ItemGroup>

<ItemGroup>
Expand Down
5 changes: 5 additions & 0 deletions test/DurableTask.Test.Orchestrations/SimpleOrchestrations.cs
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,11 @@ public async override Task<string> RunTask(OrchestrationContext context, bool us
responderOrchestration = Task.FromResult("Herkimer is done");
}

// before sending the event, wait a few seconds to ensure the sub-orchestrator exists
// otherwise, we risk a race condition where the event is dicarded because the instances table
// does not yet have the sub-orchestrator instance in it.
await context.CreateTimer<object>(context.CurrentUtcDateTime.AddSeconds(10), state: null);

Comment on lines +412 to +416
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test suddenly became flaky. So added this to make it less flakey

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right thing to do given the timeline we're working with, but I'm curious to better understand why this particular pattern isn't working reliably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know the root cause - if the ExternalEvent arrives before the ExecutionStarted event, then the IsExecutableInstance check will return false. We do that here:

bool IsExecutableInstance(OrchestrationRuntimeState runtimeState, IList<TaskMessage> newMessages, out string message)
{
if (runtimeState.ExecutionStartedEvent == null && !newMessages.Any(msg => msg.Event is ExecutionStartedEvent))
{
var instanceId = newMessages[0].OrchestrationInstance.InstanceId;
if (DurableTask.Core.Common.Entities.AutoStart(instanceId, newMessages))
{
message = null;
return true;
}
else
{
// A non-zero event count usually happens when an instance's history is overwritten by a
// new instance or by a ContinueAsNew. When history is overwritten by new instances, we
// overwrite the old history with new history (with a new execution ID), but this is done
// gradually as we build up the new history over time. If we haven't yet overwritten *all*
// the old history and we receive a message from the old instance (this happens frequently
// with canceled durable timer messages) we'll end up loading just the history that hasn't
// been fully overwritten. We know it's invalid because it's missing the ExecutionStartedEvent.
message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)";
return false;
}

So we return "no such instance" as seen above, and we end up discarding the message as it is for an instanceID that never existed. That code is here:

if (!this.IsExecutableInstance(session.RuntimeState, orchestrationWorkItem.NewMessages, out string warningMessage))
{
// If all messages belong to the same execution ID, then all of them need to be discarded.
// However, it's also possible to have messages for *any* execution ID batched together with messages
// to a *specific* (non-executable) execution ID. Those messages should *not* be discarded since
// they might be consumable by another orchestration with the same instance id but different execution ID.
var messagesToDiscard = new List<MessageData>();
var messagesToAbandon = new List<MessageData>();
foreach (MessageData msg in session.CurrentMessageBatch)
{
if (msg.TaskMessage.OrchestrationInstance.ExecutionId == session.Instance.ExecutionId)
{
messagesToDiscard.Add(msg);
}
else
{
messagesToAbandon.Add(msg);
}
}
// If no messages have a matching execution ID, then delete all of them. This means all the
// messages are external (external events, termination, etc.) and were sent to an instance that
// doesn't exist or is no longer in a running state.
if (messagesToDiscard.Count == 0)
{
messagesToDiscard.AddRange(messagesToAbandon);
messagesToAbandon.Clear();
}
// Add all abandoned messages to the deferred list. These messages will not be deleted right now.
// If they can be matched with another orchestration, then great. Otherwise they will be deleted
// the next time they are picked up.
messagesToAbandon.ForEach(session.DeferMessage);
var eventListBuilder = new StringBuilder(orchestrationWorkItem.NewMessages.Count * 40);
foreach (MessageData msg in messagesToDiscard)
{
eventListBuilder.Append(msg.TaskMessage.Event.EventType.ToString()).Append(',');
}
this.settings.Logger.DiscardingWorkItem(
this.azureStorageClient.QueueAccountName,
this.settings.TaskHubName,
session.Instance.InstanceId,
session.Instance.ExecutionId,
orchestrationWorkItem.NewMessages.Count,
session.RuntimeState.Events.Count,
eventListBuilder.ToString(0, eventListBuilder.Length - 1) /* remove trailing comma */,
warningMessage);
// The instance has already completed or never existed. Delete this message batch.
await this.DeleteMessageBatchAsync(session, messagesToDiscard);
await this.ReleaseTaskOrchestrationWorkItemAsync(orchestrationWorkItem);
return null;
}

To be clear, the piece where we end up sending external events to the discard list is this particular section:

// If no messages have a matching execution ID, then delete all of them. This means all the
// messages are external (external events, termination, etc.) and were sent to an instance that
// doesn't exist or is no longer in a running state.
if (messagesToDiscard.Count == 0)
{
messagesToDiscard.AddRange(messagesToAbandon);
messagesToAbandon.Clear();
}

So it's an ordering issue. In theory, I think we could be more conservative and not discard events immediately, but that can creating poison messages. Please let me know if I'm missing something.

// send the id of this orchestration to the responder
var responderInstance = new OrchestrationInstance() { InstanceId = responderId };
context.SendEvent(responderInstance, channelName, context.OrchestrationInstance.InstanceId);
Expand Down
Loading