Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Persistence.Query Throttling implementation - "QueryPermitter" #6436

Merged
merged 15 commits into from
Feb 24, 2023

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Feb 24, 2023

Changes

Implements #6404 using the same design as Akka.Persistence's base RecoveryPermitter. Gets used on every single iteration of each query.

Related: #6417, #6433

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Latest dev Benchmarks

Include data from the relevant benchmark prior to this change here.

This PR's Benchmarks

Include data from after this change here.

@Aaronontheweb
Copy link
Member Author

SQlite test suite passed locally. Going to try running this inside https://github.com/Aaronontheweb/AkkaSqlQueryCrushTest

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Feb 24, 2023

Added a "crush test" to apply maximum stress to Akka.Persistence.SqlServer: https://github.com/Aaronontheweb/AkkaSqlQueryCrushTest

1000 recoveries, 1000 * 10 writes, 1000 * 10 persistence queries doing 1 event per second

Baseline

Using Akka.NET v1.4.49 and latest for Akka.Persistence.SqlServer:

[INFO][2/20/2023 11:24:09 PM][Thread 0054][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 76855.5798ms

This PR

Using 1.5.0-beta2 with this PR and max concurrent queries = 30

[INFO][2/20/2023 11:24:09 PM][Thread 0054][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 76855.5798ms

@Aaronontheweb
Copy link
Member Author

Redesigned the benchmark to eliminate setup overhead.

Baseline

[INFO][2/24/2023 3:36:07 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 21990.5583ms

This PR

[INFO][02/24/2023 03:41:21.202Z][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 23996.9769ms

@Aaronontheweb
Copy link
Member Author

Ran the benchmarks with write-side disabled, queries only - might be seeing some of the divergence I expected now...

Baseline

[INFO][2/24/2023 3:43:42 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 11191.9144ms

This PR

[INFO][2/24/2023 3:41:31 AM][Thread 0051][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 10092.23321ms

Going to try this with 10,000 parallel queries

@Aaronontheweb
Copy link
Member Author

10,000 concurrent queries, running continuously

Baseline

[INFO][2/24/2023 3:52:24 AM][Thread 0034][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 66219.7802ms

This PR

[INFO][02/24/2023 03:50:30.356Z][Thread 0057][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 66481.2036ms

Looks to me like whatever issues users were able to reproduce with PostgreSql absolutely melting down around ~3,000 concurrent queries, my dinky SQL Server instance running inside a Docker container appears to shrug off 10,000. Maybe I need to increase the event counts in the journal to strain the indicies some.

@Aaronontheweb
Copy link
Member Author

Dialing up the tests to do 100,000 entities / queries with 100 events each.

@Arkatufus
Copy link
Contributor

Arkatufus commented Feb 24, 2023

In the AllEventsPublisher.cs file, inside the AbstractAllEventPublisher class, inside the Replaying receive method, all Buffer.DeliverBuffer() method call that doesn't exit the receive method (line 150 and 163, I think) need to be changed:

Buffer.DeliverBuffer(TotalDemand);
if (Buffer.IsEmpty && CurrentOffset > ToOffset)
    OnCompleteThenStop();

EDIT: Not the cause of the test failure

@Aaronontheweb
Copy link
Member Author

New test - 100,000 entities with 100 events each, querying up to 10 events at a time.

Baseline

Recovery ran for 20 minutes, with multiple timeout failures occurring. Only 1/100_000 entities successfully recovered.

This PR

(WIP)

@Aaronontheweb
Copy link
Member Author

This PR

Recovered 1m events in

[INFO][02/24/2023 06:34:06.647Z][Thread 0045][akka://SqlSharding/user/recovery-tracker] Recovery complete - took 95943.371ms

Some notes:

  • DB CPU never exceeded 20%, dropped to 2% once there were no more results to return (but queries were still running - just didn't find anything)
  • App memory never exceeded 200mb

@Aaronontheweb
Copy link
Member Author

Completely recovered 10m events with 100,000 queries running continuously

Done in 00:39:58.8746578

@Aaronontheweb
Copy link
Member Author

Going to run our control one more time...

@Aaronontheweb
Copy link
Member Author

Control experiment started persistently erroring out after about 1 minute - about 20% of query attempts made it through without erroring out beyond that. The throttler definitely preserves the integrity of the system.

@Aaronontheweb Aaronontheweb marked this pull request as ready for review February 24, 2023 20:25
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed a review of my own changes

@@ -40,39 +40,19 @@
<CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies>
</PropertyGroup>
<PropertyGroup>
<PackageReleaseNotes>Version 1.5.0-beta1 contains **breaking API changes** and new API changes for Akka.NET.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore - this is just build.cmd output.


private Continue() { }
}

public static Props Props(long fromOffset, TimeSpan? refreshInterval, int maxBufferSize, string writeJournalPluginId)
public static Props Props(long fromOffset, TimeSpan? refreshInterval, int maxBufferSize, IActorRef writeJournal, IActorRef queryPermitter)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IActorRef queryPermitter parameter has been added to all queries - this is the actor responsible for managing token buckets for permitting individual queries to run.

Become(WaitingForQueryPermit);
}

protected bool WaitingForQueryPermit(object message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All queries now run with an execution flow that looks similar to this:

stateDiagram-v2
    [*] --> Init: Request (from Akka.Streams)
    Init --> RequestQueryPermit: RequestQueryStart
    RequestQueryPermit --> Replay: QueryStartGranted
    Replay --> Replay: ReplayedMessage
    Replay --> Idle: RecoverySuccess (return token to permitter)
    Idle --> RequestQueryPermit: Continue
Loading

At each stage in the query, we need permission from the QueryPermitter to continue - this is designed to help rate-limit the number of queries that can hit the database at any given time.

@@ -129,6 +157,7 @@ protected bool Replaying( object message )
Log.Error(failure.Cause, "event replay failed, due to [{0}]", failure.Cause.Message);
Buffer.DeliverBuffer(TotalDemand);
OnErrorThenStop(failure.Cause);
QueryPermitter.Tell(ReturnQueryStart.Instance); // return token to permitter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return tokens when we fail and when we succeed - always return them after the database has completed the query, successfully or otherwise.

/// <summary>
/// Request token from throttler
/// </summary>
internal sealed class RequestQueryStart
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message for requesting a token

/// <summary>
/// Return token to throttler
/// </summary>
internal sealed class ReturnQueryStart
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token returned to pool.

/// <remarks>
/// Works identically to the RecoveryPermitter built into Akka.Persistence.
/// </remarks>
internal sealed class QueryThrottler : ReceiveActor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actor is almost an exact duplicate of the RecoveryPermitter built into the base Akka.Persistence base class - that design has been working successfully for years to solve this problem during PeristentActor recovery.


Receive<RequestQueryStart>(_ =>
{
Context.Watch(Sender);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Death watch requestors of tokens, so we can return used tokens in the event of dead requestors.

private void ReturnQueryPermit(IActorRef actorRef)
{
_usedPermits--;
Context.Unwatch(actorRef);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always unwatch someone who returns a permit.

_maxBufferSize = config.GetInt("max-buffer-size", 0);
_maxConcurrentQueries = config.GetInt("max-concurrent-queries", 50);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this value configurable in HOCON - 50 by default.

@Aaronontheweb
Copy link
Member Author

Going to add some documentation for this feature in the "what's new" page - but we don't have a dedicated page for fine-tuning Akka.Persistence.Query today.

@Arkatufus
Copy link
Contributor

Question regarding the token system, if queries are failing mid-way, will the available token eventually ran out because the failing queries never returned them?

@Aaronontheweb
Copy link
Member Author

Question regarding the token system, if queries are failing mid-way, will the available token eventually ran out because the failing queries never returned them?

If the stream stage fails, the tokens are automatically returned when the actor kills itself. If the journal returns an error, we also return tokens then too @Arkatufus

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants