Akka.Persistence.Query Throttling implementation - "QueryPermitter" #6436

Aaronontheweb · 2023-02-24T02:44:19Z

Changes

Implements #6404 using the same design as Akka.Persistence's base RecoveryPermitter. Gets used on every single iteration of each query.

Related: #6417, #6433

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
This change follows the Akka.NET Wire Compatibility Guidelines.
I have reviewed my own pull request.
Design discussion issue Akka.Persistence.Query: need backpressure control for queries #6404
I have added website documentation for this feature.

Latest `dev` Benchmarks

Include data from the relevant benchmark prior to this change here.

This PR's Benchmarks

Include data from after this change here.

…f` param refactor all Akka.Persistence.Query implementations take an `IActorRef` param so we can swap the journal reference for a "throttler" actor in the middle

This reverts commit e9d7639.

Aaronontheweb · 2023-02-24T02:44:52Z

SQlite test suite passed locally. Going to try running this inside https://github.com/Aaronontheweb/AkkaSqlQueryCrushTest

Aaronontheweb · 2023-02-24T03:23:50Z

Added a "crush test" to apply maximum stress to Akka.Persistence.SqlServer: https://github.com/Aaronontheweb/AkkaSqlQueryCrushTest

1000 recoveries, 1000 * 10 writes, 1000 * 10 persistence queries doing 1 event per second

Baseline

Using Akka.NET v1.4.49 and latest for Akka.Persistence.SqlServer:

[INFO][2/20/2023 11:24:09 PM][Thread 0054][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 76855.5798ms

This PR

Using 1.5.0-beta2 with this PR and max concurrent queries = 30

[INFO][2/20/2023 11:24:09 PM][Thread 0054][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 76855.5798ms

Aaronontheweb · 2023-02-24T03:41:46Z

Redesigned the benchmark to eliminate setup overhead.

Baseline

[INFO][2/24/2023 3:36:07 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 21990.5583ms

This PR

[INFO][02/24/2023 03:41:21.202Z][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 23996.9769ms

Aaronontheweb · 2023-02-24T03:46:36Z

Ran the benchmarks with write-side disabled, queries only - might be seeing some of the divergence I expected now...

Baseline

[INFO][2/24/2023 3:43:42 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 11191.9144ms

This PR

[INFO][2/24/2023 3:41:31 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 10092.23321ms

Going to try this with 10,000 parallel queries

Aaronontheweb · 2023-02-24T04:00:06Z

10,000 concurrent queries, running continuously

Baseline

[INFO][2/24/2023 3:52:24 AM][Thread 0034][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 66219.7802ms

This PR

[INFO][02/24/2023 03:50:30.356Z][Thread 0057][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 66481.2036ms

Looks to me like whatever issues users were able to reproduce with PostgreSql absolutely melting down around ~3,000 concurrent queries, my dinky SQL Server instance running inside a Docker container appears to shrug off 10,000. Maybe I need to increase the event counts in the journal to strain the indicies some.

Aaronontheweb · 2023-02-24T04:02:21Z

Dialing up the tests to do 100,000 entities / queries with 100 events each.

…ies are completed

Arkatufus · 2023-02-24T15:45:17Z

In the AllEventsPublisher.cs file, inside the AbstractAllEventPublisher class, inside the Replaying receive method, all Buffer.DeliverBuffer() method call that doesn't exit the receive method (line 150 and 163, I think) need to be changed:

Buffer.DeliverBuffer(TotalDemand);
if (Buffer.IsEmpty && CurrentOffset > ToOffset)
    OnCompleteThenStop();

EDIT: Not the cause of the test failure

This reverts commit 6ae4f41.

Aaronontheweb · 2023-02-24T15:58:07Z

New test - 100,000 entities with 100 events each, querying up to 10 events at a time.

Baseline

Recovery ran for 20 minutes, with multiple timeout failures occurring. Only 1/100_000 entities successfully recovered.

This PR

(WIP)

Aaronontheweb · 2023-02-24T18:39:32Z

This PR

Recovered 1m events in

[INFO][02/24/2023 06:34:06.647Z][Thread 0045][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 95943.371ms

Some notes:

DB CPU never exceeded 20%, dropped to 2% once there were no more results to return (but queries were still running - just didn't find anything)
App memory never exceeded 200mb

Aaronontheweb · 2023-02-24T20:19:58Z

Completely recovered 10m events with 100,000 queries running continuously

Done in 00:39:58.8746578

Aaronontheweb · 2023-02-24T20:22:06Z

Going to run our control one more time...

Aaronontheweb · 2023-02-24T20:25:21Z

Control experiment started persistently erroring out after about 1 minute - about 20% of query attempts made it through without erroring out beyond that. The throttler definitely preserves the integrity of the system.

Aaronontheweb

Performed a review of my own changes

Aaronontheweb · 2023-02-24T20:25:48Z

src/common.props

@@ -40,39 +40,19 @@
    <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies>
  </PropertyGroup>
  <PropertyGroup>
-    <PackageReleaseNotes>Version 1.5.0-beta1 contains **breaking API changes** and new API changes for Akka.NET.


Ignore - this is just build.cmd output.

Aaronontheweb · 2023-02-24T20:26:38Z

src/contrib/persistence/Akka.Persistence.Query.Sql/AllEventsPublisher.cs


            private Continue() { }
        }

-        public static Props Props(long fromOffset, TimeSpan? refreshInterval, int maxBufferSize, string writeJournalPluginId)
+        public static Props Props(long fromOffset, TimeSpan? refreshInterval, int maxBufferSize, IActorRef writeJournal, IActorRef queryPermitter)


The IActorRef queryPermitter parameter has been added to all queries - this is the actor responsible for managing token buckets for permitting individual queries to run.

Aaronontheweb · 2023-02-24T20:48:02Z

src/contrib/persistence/Akka.Persistence.Query.Sql/AllEventsPublisher.cs

+            Become(WaitingForQueryPermit);
+        }
+
+        protected bool WaitingForQueryPermit(object message)


All queries now run with an execution flow that looks similar to this:

stateDiagram-v2 [*] --> Init: Request (from Akka.Streams) Init --> RequestQueryPermit: RequestQueryStart RequestQueryPermit --> Replay: QueryStartGranted Replay --> Replay: ReplayedMessage Replay --> Idle: RecoverySuccess (return token to permitter) Idle --> RequestQueryPermit: Continue

Loading

At each stage in the query, we need permission from the QueryPermitter to continue - this is designed to help rate-limit the number of queries that can hit the database at any given time.

Aaronontheweb · 2023-02-24T20:48:40Z

src/contrib/persistence/Akka.Persistence.Query.Sql/AllEventsPublisher.cs

@@ -129,6 +157,7 @@ protected bool Replaying( object message )
                    Log.Error(failure.Cause, "event replay failed, due to [{0}]", failure.Cause.Message);
                    Buffer.DeliverBuffer(TotalDemand);
                    OnErrorThenStop(failure.Cause);
+                    QueryPermitter.Tell(ReturnQueryStart.Instance); // return token to permitter


Return tokens when we fail and when we succeed - always return them after the database has completed the query, successfully or otherwise.

Aaronontheweb · 2023-02-24T20:49:05Z