Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Net actors freezing #4376

Closed
Nk185 opened this issue Apr 10, 2020 · 10 comments · Fixed by #4393
Closed

Akka.Net actors freezing #4376

Nk185 opened this issue Apr 10, 2020 · 10 comments · Fixed by #4393
Assignees
Milestone

Comments

@Nk185
Copy link

Nk185 commented Apr 10, 2020

Environment

  • OS: Windows 10.0.17134.885 (1803/April2018Update/Redstone4)
  • CPU: Intel Core i5-8600K CPU 3.60GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
  • .Net Core SDK: 3.1.102; .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
  • Akka.Net: 1.4.4 or older
  • Build configuration: Release x64

Symptoms

After several restarts, actors behind a pool router stop working.

How to reproduce

Either use the code from the gist or follow the following:

  1. Create a simple actor that can either do something (e.g., write to stdout) or throw an exception.
  2. Instantiate ActorSystem and spawn the actor from step 1 behind a pool router with, say, 10 routees.
  3. In a loop (the lowest number of iterations I tried was 5) force the actor from step 1 to throw an exception.
  4. Optionally wait a bit after the loop ends.
  5. In a loop with a number of iterations equal to the number of routees try to force the actor from step 1 to do something
  6. Observe that not all of the actors are processing the message

Side note: you may need to play with a number of thrown exceptions

Gist overview

In the Gist I'm trying in different ways to make the SimpleActor to throw an exception and then print a text.
Six main scenarios tried:

  • an actor behind a round-robin pool router without routees supervision strategy specified (i.e., the default on - to escalate issue),
  • same but with routees supervision strategy "Restart",
  • same but with routees supervision strategy "Resume",
  • same but with routees supervision strategy "Escalate" (actually the same as the first way),
  • custom round-robin router based on an actor with supervision strategy "Escalate"
  • single actor without router

For each of the scenarios with a router (so all except the last one), I'm forcing the SimpleActor to throw an exception. Then I'm getting all routees and "asking" them to print a text with their address to stdout both with direct Tell and indirect via a router.

Main observations based on Gist code

  1. In all cases where routees supervision strategy is not "Escalate" I'm able to see that they are printing a text to stdout both through direct and indirect Tells.
  2. In cases where routees supervision strategy is "Escalate" one or more routees were not printing out any text to stdout (or at least reading the message from their mailbox) neither by direct nor indirect Tells.
  3. All actors that got stuck had MailboxStatus equal to 4 (SuspendUnit) and SuspendMask set.
  4. The issue reproduces with custom ReceiveActor-based router which means that issue is not in (or is not only in) implementation of RoutedActorCell and RouterActor.
  5. To "catch" the issue you need to play with a number of thrown exception - if not enough exceptions thrown or there was a delay between them issue will not reproduce.

Assumptions

Taking into account that the number of exceptions and their frequency has an effect and each time number of frozen routees is different and that I wasn't able to reproduce the issue with non-Escalate supervision strategy, I can presume that there is a race condition during escalations in Dispatcher or in Mailbox that prevents removal of SuspendUnit from actor's mailbox.

Workaround

Option 1

The easiest way to work this out is to specify a needed supervision strategy at router level like Props.Create<YourAwesomeActor>().WithRouter(new RoundRobinPool(routeesNr, null, new OneForOneStrategy(Decider.From(Directive.Restart)), Dispatchers.DefaultDispatcherId)); as it leads to same default behaviour - restarts your routee (unless you overrode it in configs).

Option 2

If you have to have more complicated logic that based on router parent's state, you have to write your own router actor but be aware that this will lead to a performance impact.

@Aaronontheweb
Copy link
Member

Thanks for the really detailed write-up - we'll look into it

@Arkatufus
Copy link
Contributor

Arkatufus commented Apr 27, 2020

The bug happened when you fail multiple children of a pool node with Directive.Escalate in rapid succession.

On a failure, the child will suspend itself before bubbling a Failed message to the pool node. The parent pool node, receiving the Failed message, will in turn sets its IsFailed flag and bubble the same message to its parent. The problem starts when another child failed before the pool node recovers. Since the child always suspends itself before bubbling up a Failure event, it will never get resumed again, because its Failure event will never go past the failed pool node, because the pool node will refuse to process any subsequent Failure event since its IsFailed flag is set. This causes an unbalanced suspend on the child that can never be resolved, except by restarting the pool node.

The child was suspended here: https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/ActorCell.FaultHandling.cs#L231

@Aaronontheweb
Copy link
Member

@Arkatufus so it sounds like the issue is that the state for the superviser, when using OneToOne strategies, doesn't account for multiple children failing once the parent has already escalated once. Can you reproduce this problem for actors where the parent isn't a pool router? Or is it a router-specific problem?

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.5, 1.4.6 Apr 29, 2020
@Arkatufus
Copy link
Contributor

This is, somehow, a pool router only problem

@Nk185
Copy link
Author

Nk185 commented Apr 30, 2020

@Arkatufus may I ask you to check things again? I was able to reproduce the issue for a non-router actor (the CustomRouter<T> in my Gist) which is a regular ReceiveActor with many failing children. Btw, this was also outlined in the 4th point of the "Main observations based on Gist code" paragraph.

@Aaronontheweb
Copy link
Member

@Nk185 have you followed @Arkatufus 's work on this issue on #4393 ? Might want to review the code and changes there.

@Arkatufus
Copy link
Contributor

@Nk185 I might need to make a more comprehensive test on it, I've only managed to fail the pool router so far.

@Arkatufus
Copy link
Contributor

As progress goes, I've found that there is a code discrepancy between Akka.NET and scala Akka on this line:
https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka/Actor/ActorCell.DefaultMessages.cs#L260

The check should be living inside the switch statement, and not outside of it. Fixing that fixes the spurious exception handling by the Actor supervisor, but something else is still blocking the Mailbox from getting resumed.
I'm still looking into the cause.

Arkatufus added a commit to Arkatufus/akka.net that referenced this issue May 7, 2020
@Arkatufus
Copy link
Contributor

This seems to fix the bug, need further testing

@Aaronontheweb
Copy link
Member

@Arkatufus nice work - I'll review it

Aaronontheweb added a commit that referenced this issue May 8, 2020
* Add bug reproduction spec

* Debug test program

* Fix #4376, Actor suspended indefinetly after failing

* Fix broken Visual Studio solution file

* Unroll recursion in SysMsgInvokeAll

Co-authored-by: Aaron Stannard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants