Akka.Net actors freezing #4376

Nk185 · 2020-04-10T17:23:24Z

Environment

OS: Windows 10.0.17134.885 (1803/April2018Update/Redstone4)
CPU: Intel Core i5-8600K CPU 3.60GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.Net Core SDK: 3.1.102; .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Akka.Net: 1.4.4 or older
Build configuration: Release x64

Symptoms

After several restarts, actors behind a pool router stop working.

How to reproduce

Either use the code from the gist or follow the following:

Create a simple actor that can either do something (e.g., write to stdout) or throw an exception.
Instantiate ActorSystem and spawn the actor from step 1 behind a pool router with, say, 10 routees.
In a loop (the lowest number of iterations I tried was 5) force the actor from step 1 to throw an exception.
Optionally wait a bit after the loop ends.
In a loop with a number of iterations equal to the number of routees try to force the actor from step 1 to do something
Observe that not all of the actors are processing the message

Side note: you may need to play with a number of thrown exceptions

Gist overview

In the Gist I'm trying in different ways to make the SimpleActor to throw an exception and then print a text.
Six main scenarios tried:

an actor behind a round-robin pool router without routees supervision strategy specified (i.e., the default on - to escalate issue),
same but with routees supervision strategy "Restart",
same but with routees supervision strategy "Resume",
same but with routees supervision strategy "Escalate" (actually the same as the first way),
custom round-robin router based on an actor with supervision strategy "Escalate"
single actor without router

For each of the scenarios with a router (so all except the last one), I'm forcing the SimpleActor to throw an exception. Then I'm getting all routees and "asking" them to print a text with their address to stdout both with direct Tell and indirect via a router.

Main observations based on Gist code

In all cases where routees supervision strategy is not "Escalate" I'm able to see that they are printing a text to stdout both through direct and indirect Tells.
In cases where routees supervision strategy is "Escalate" one or more routees were not printing out any text to stdout (or at least reading the message from their mailbox) neither by direct nor indirect Tells.
All actors that got stuck had MailboxStatus equal to 4 (SuspendUnit) and SuspendMask set.
The issue reproduces with custom ReceiveActor-based router which means that issue is not in (or is not only in) implementation of RoutedActorCell and RouterActor.
To "catch" the issue you need to play with a number of thrown exception - if not enough exceptions thrown or there was a delay between them issue will not reproduce.

Assumptions

Taking into account that the number of exceptions and their frequency has an effect and each time number of frozen routees is different and that I wasn't able to reproduce the issue with non-Escalate supervision strategy, I can presume that there is a race condition during escalations in Dispatcher or in Mailbox that prevents removal of SuspendUnit from actor's mailbox.

Workaround

Option 1

The easiest way to work this out is to specify a needed supervision strategy at router level like Props.Create<YourAwesomeActor>().WithRouter(new RoundRobinPool(routeesNr, null, new OneForOneStrategy(Decider.From(Directive.Restart)), Dispatchers.DefaultDispatcherId)); as it leads to same default behaviour - restarts your routee (unless you overrode it in configs).

Option 2

If you have to have more complicated logic that based on router parent's state, you have to write your own router actor but be aware that this will lead to a performance impact.

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2020-04-22T20:40:39Z

Thanks for the really detailed write-up - we'll look into it

Arkatufus · 2020-04-27T14:32:35Z

The bug happened when you fail multiple children of a pool node with Directive.Escalate in rapid succession.

On a failure, the child will suspend itself before bubbling a Failed message to the pool node. The parent pool node, receiving the Failed message, will in turn sets its IsFailed flag and bubble the same message to its parent. The problem starts when another child failed before the pool node recovers. Since the child always suspends itself before bubbling up a Failure event, it will never get resumed again, because its Failure event will never go past the failed pool node, because the pool node will refuse to process any subsequent Failure event since its IsFailed flag is set. This causes an unbalanced suspend on the child that can never be resolved, except by restarting the pool node.

The child was suspended here: https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/ActorCell.FaultHandling.cs#L231

Aaronontheweb · 2020-04-27T16:51:32Z

@Arkatufus so it sounds like the issue is that the state for the superviser, when using OneToOne strategies, doesn't account for multiple children failing once the parent has already escalated once. Can you reproduce this problem for actors where the parent isn't a pool router? Or is it a router-specific problem?

Arkatufus · 2020-04-30T19:07:38Z

This is, somehow, a pool router only problem

Nk185 · 2020-04-30T21:25:25Z

@Arkatufus may I ask you to check things again? I was able to reproduce the issue for a non-router actor (the CustomRouter<T> in my Gist) which is a regular ReceiveActor with many failing children. Btw, this was also outlined in the 4th point of the "Main observations based on Gist code" paragraph.

Aaronontheweb · 2020-05-05T14:49:34Z

@Nk185 have you followed @Arkatufus 's work on this issue on #4393 ? Might want to review the code and changes there.

Arkatufus · 2020-05-06T20:14:56Z

@Nk185 I might need to make a more comprehensive test on it, I've only managed to fail the pool router so far.

Arkatufus · 2020-05-06T20:17:45Z

As progress goes, I've found that there is a code discrepancy between Akka.NET and scala Akka on this line:
https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/ActorCell.DefaultMessages.cs#L260

The check should be living inside the switch statement, and not outside of it. Fixing that fixes the spurious exception handling by the Actor supervisor, but something else is still blocking the Mailbox from getting resumed.
I'm still looking into the cause.

Arkatufus · 2020-05-07T14:53:04Z

This seems to fix the bug, need further testing

Aaronontheweb · 2020-05-07T14:59:10Z

@Arkatufus nice work - I'll review it

* Add bug reproduction spec * Debug test program * Fix #4376, Actor suspended indefinetly after failing * Fix broken Visual Studio solution file * Unroll recursion in SysMsgInvokeAll Co-authored-by: Aaron Stannard <[email protected]>

Aaronontheweb added akka-actor potential bug labels Apr 22, 2020

Aaronontheweb assigned Arkatufus Apr 22, 2020

Arkatufus added confirmed bug and removed potential bug labels Apr 27, 2020

Arkatufus mentioned this issue Apr 27, 2020

Bug fix #4376 #4393

Merged

Aaronontheweb modified the milestones: 1.4.5, 1.4.6 Apr 29, 2020

Arkatufus added a commit to Arkatufus/akka.net that referenced this issue May 7, 2020

Fix akkadotnet#4376, Actor suspended indefinetly after failing

7531c48

Aaronontheweb closed this as completed in #4393 May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka.Net actors freezing #4376

Akka.Net actors freezing #4376

Nk185 commented Apr 10, 2020 •

edited

Loading

Aaronontheweb commented Apr 22, 2020

Arkatufus commented Apr 27, 2020 •

edited

Loading

Aaronontheweb commented Apr 27, 2020

Arkatufus commented Apr 30, 2020

Nk185 commented Apr 30, 2020

Aaronontheweb commented May 5, 2020

Arkatufus commented May 6, 2020

Arkatufus commented May 6, 2020

Arkatufus commented May 7, 2020

Aaronontheweb commented May 7, 2020

Akka.Net actors freezing #4376

Akka.Net actors freezing #4376

Comments

Nk185 commented Apr 10, 2020 • edited Loading

Environment

Symptoms

How to reproduce

Gist overview

Main observations based on Gist code

Assumptions

Workaround

Option 1

Option 2

Aaronontheweb commented Apr 22, 2020

Arkatufus commented Apr 27, 2020 • edited Loading

Aaronontheweb commented Apr 27, 2020

Arkatufus commented Apr 30, 2020

Nk185 commented Apr 30, 2020

Aaronontheweb commented May 5, 2020

Arkatufus commented May 6, 2020

Arkatufus commented May 6, 2020

Arkatufus commented May 7, 2020

Aaronontheweb commented May 7, 2020

Nk185 commented Apr 10, 2020 •

edited

Loading

Arkatufus commented Apr 27, 2020 •

edited

Loading