Skip to content
This repository has been archived by the owner on Nov 10, 2022. It is now read-only.

add comparison with Orleans #8

Merged
merged 4 commits into from
May 28, 2015
Merged

add comparison with Orleans #8

merged 4 commits into from
May 28, 2015

Conversation

rkuhn
Copy link
Contributor

@rkuhn rkuhn commented Apr 30, 2015

No description provided.

@rkuhn rkuhn mentioned this pull request Apr 30, 2015
@gabikliot
Copy link

There is a lot of inaccuracies in the description of Orleans (not sure why people pick to assume the worse when they lack information, it probably in our human nature), but the main one is about messaging guarantees. Orleans guarantees at most once and not at least once, by default:
http://dotnet.github.io/orleans/Runtime-Implementation-Details/Messaging-Delivery-Guarantees.html

@gabikliot
Copy link

Another inaccuracy:
For example: "Grains can only work in either the fully blocked or fully reentrant modes, limiting the user’s choices to a safe one and a fast one".
Of course this is completely not true. One can pick to queue a continuation (await) on a different scheduler (thread pool for example). This is achieved by the fact that we fully integrate with .NET TPL library and this is natively supported by TPL: Task.StartNew, Task.Run, Task.ContinueWith all provide overloads to specify a different scheduler or synchronization context.

@gabikliot
Copy link

If you are truly interested I can write a detailed bullet by bullet response to the inaccuracies in this summary.

@rkuhn
Copy link
Contributor Author

rkuhn commented Apr 30, 2015 via email

@rkuhn
Copy link
Contributor Author

rkuhn commented Apr 30, 2015 via email

@gabikliot
Copy link

There was one mistake in the paper, about Messaging Delivery Guarantees. All the rest is correct.
The other inaccuracies I think stem from making assumptions about things that were not described or described partly. Naturally, one cannot describe all details in a short paper.
The paper, as its name says, is about the Virtual Actor abstraction and its benefits. Its not a full 100% detailed explanation of all we did in project Orleans in 6 years.

Much more details are in our documentation web site: http://dotnet.github.io/orleans/.
I will provide more detailed comments.

EDIT: Also, the paper was written quiet a while ago, the system of course kept evolving and we added more capabilities and fixed/changed/improved certain things since the paper. The paper is not to be blamed for that, right?

EDIT One thing that I can fully agree with: Akka has a much nicer and detailed online documentation that Orleans. Orleans is still young on Github (we only got to GH in January), so we did not yet fully catch up with documentation. We will.

@rkuhn
Copy link
Contributor Author

rkuhn commented Apr 30, 2015 via email


* Akka as a toolkit for building distributed systems, offering the full power but also exposing the inherent essential complexity.

* Orleans restricts applicability in order to allow seamless use without understanding distributed computing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orleans allows non distributed computing experts to write distributed services, but it does not mean that an expert can not find it useful or necessarily limited in applicability or with less performance. An expert can configure/extend it for a wide range of applications. I would phrase it: "Orleans simplifies distributed computing allowing non experts to write distributed services. At the same time there is enough extension points and flexibility in Orleans to allow even an expert to customize his services in a flexible way". Arguably, Akka has more extensibility than Orleans at this point (I don't know enough Akka to say if this is the case for sure, but if someone who knows both systems says that, I would easily believe her). One also does not want to provide too many extensibility points, otherwise it becomes too clunky, too complicated to understand and use.
But it would not be true to say that Orleans does not provide a lot of extensibilities as well or is only good for "dumb non-experts".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of the point-by-point comparison is to highlight the differences between both approaches, not to sell one or the other. Therefore since both Orleans and Akka provide extensibility, the interesting question in this particular section is what each solution targets, and reading the TR I came to the conclusion that—bluntly speaking—Akka requires developers to understand distributed computing while Orleans aims at avoiding the necessity for that. This is supported by quotes such as

To build a correct solution to such problems in the application, the developer must be a distributed systems expert. To avoid these complexities, we built the Orleans programming model and runtime, which raises the level of the actor abstraction.

and

This level of indirection provides the runtime with the opportunity to solve many hard distributed systems problems that must otherwise be addressed by the developer.

Since I assume that you are one of the co-authors of the paper, you are in a good position to answer the question: what is the primary focus of Orleans? It is completely fine if Orleans also does other things as secondary concerns, but I get the feeling that the primary focus of Orleans and Akka is fundamentally different, so that is what I would like to highlight here.

I’ll flesh out the intro of “different focus” to clarify this intent of this section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am one of the authors. Undoubtedly, the primary focus of Orleans is to simplify distributed computing and allow not experts to write efficient, scalable and reliable distributed services. "Guide the developers down a path of best practices" principle is exactly about that.
As a secondary focus we also provide a flexible platform for more expert developers.

@gabikliot
Copy link

@rkuhn, I am replying inline.
At this stage I also wanted to separate clarifying Orleans issues from asking questions about Akka. We do have a lot of questions about Akka, but I will hold all of them and first would try to get clarity about Orleans, and then as a second step get deeper into Akka.
Would that be OK?


#### Virtual Actor Space

* In Orleans each type of Grain corresponds to a practically infinite space of Grain instances that conceptually all exist from the beginning of the universe to its end. The relation to the physical Actors that implement the Grains is explained similar to virtual and physical memory, but this comparison is misleading since the virtual address space of a process is explicitly populated with the desired contents instead of containing the whole system’s information by default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analogy was used in the on demand paging, and separation of virtual addresses vs. physical addresses. In that context we do think this is a very suitable, an even a great analogy. Virtual actors have multiple aspects, not all of them are similar to virtual memory (the eternal existence is not), but on demand instantiation and automatic reclamination (similar to virtual memory swap out) is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will clarify that the analogy applies to that aspect.

@gabikliot
Copy link

Thank you @rkuhn. Overall looks much more accurate. There are still a couple of places where you did not incorporate my feedback.

  1. the optimized local msgs - I explained how both our approaches are similar and that "Akka’s lack of indirection DOES NOT allow substantial performance optimizations in this regard", compared to Orleans. We do the same.
  2. I think you did not stress that Akka breaks grain isolation, by "the default executors do not respect this constraint and will typically run Future continuations concurrently with the Actor that scheduled them."
  3. "Grains can only work in either the fully blocked or fully reentrant modes" I explained how this is not correct. Grain can have a mixed mode as well (via TPL primitives). And Grains can also support "becoming".

@rkuhn
Copy link
Contributor Author

rkuhn commented May 4, 2015

Yes, I have not yet incorporated everything, I ran into the wall of my time box (this week is rather busy, I’ll come back to this).

@rkuhn
Copy link
Contributor Author

rkuhn commented May 28, 2015

Sorry that it took a little longer, I had added a commit that should address the outstanding comments, please review. If everything is good then I’d like to merge.

@gabikliot
Copy link

Looks good. Thanks.

@gabikliot
Copy link

Ronald, I have a question:
@sergeybykov wanted to comment on one of your previous comments:

"The important refinement that is missing here is that neither horizontal nor vertical is the solution, instead we must acknowledge that both are orthogonal concerns and contribute to the solution. For the definitions I’m using please refer to the glossary of the Reactive Manifesto: Errors are made by clients and need to be signaled back to them while Failures render the service unable to perform its function and need vertical help for recovery. We call that supervision, it might be called differently, but the crucial notion is that clients shall not receive Failures. The other notable fact is that stack traces are meaningless in distributed systems, which makes throwing exceptions even less appropriate for Error handling than it was in classical (local) OO programming: the service should reply with a normal value instead that denominates the error.
In any case, can we conclude that I fix the description of shared-state concurrency problems to clarify that low-level data races are removed while high-level message races remain? And I shall also create a new paragraph describing the difference in failure/error handling philosophy."

but we seem not to be able to find that original comment.

So is my comment about "Distributed Systems Bibles" and Reactive Manifesto not being one of them (for me personally). I actually think my comment is important and expresses something that people need to hear.

What happened to those comments?

@rkuhn
Copy link
Contributor Author

rkuhn commented May 28, 2015

Comments are shown as “outdated diff” in the history above, you can expand them and still read them, but since I corrected the text line that they referred to they don’t apply to the current version any longer (in github’s opinion).

One way to avoid this “comment folding” is to make overall comments (as we are doing right now) or comment on individual commits instead of on the PR—this works better if the history is kept, which I intend to do here.

In any case, thanks for lifting these important ones up to this level, they are easily accessible now.

Concerning failures/errors I just added another commit that should clarify the wording and more precisely capture the difference in philosophy between Orleans and Akka. If the current text is okay from your side then I’d like to merge it—we can always discuss and make changes via new issues or PRs on this repository.

@gabikliot
Copy link

Ohh, I see now! Thanks! That was tricky.

Yes, go ahead please. I say lets merge it all and you can just "publish" this whole doc and then we can potentially submit new pull requests to further extend some points. I think it is in a wip branch. Can maybe make it to the main branch. The current version is definitely good enough as version 1.

@rkuhn
Copy link
Contributor Author

rkuhn commented May 28, 2015

Thanks for your help, @gabikliot !

rkuhn added a commit that referenced this pull request May 28, 2015
add comparison with Orleans
@rkuhn rkuhn merged commit f193b15 into master May 28, 2015
@rkuhn rkuhn deleted the wip-Orleans branch May 28, 2015 18:05
@sergeybykov
Copy link

The other notable fact is that stack traces are meaningless in distributed systems, which makes throwing exceptions even less appropriate for Error handling than it was in classical (local) OO programming: the service should reply with a normal value instead that denominates the error.

@rkuhn It's interesting that you mentioned it. In Orleans, because of the async RPC as the primary mode of interacting with actors and the automatic propagation of errors, stack traces are actually meaningful. Let me illustrate this with an example.

Say you have a web frontend (FE) that upon receiving a REST request makes a call to method X of grain A. As part of executing A.X(), grain A makes a call to method Y of grain B. B.Y() in its turn makes a call to storage (ST) that throws an exception, e.g. storage is temporarily unavailable. So you have a call chain of FE -> A.X() -> B.Y() -> ST.

It is enough to put a try/catch at the FE level to handle such errors AND to have a meaningful call stack of the error. So if you write something like

try
{
    var x = await a.X(args);
}
catch(Exception exc)
{
   log(exc);
}

then exception exc will contain a chain of inner exceptions and their call stacks: the original exception throw by ST and exceptions re-thrown by B and A with their respective call stacks:

ExceptionA (A.X()/A.XImpl()/A.CallB())
ExceptionB (B.Y()/B.YInternal()/B.SaveToStorage())
ExceptionST (ST.Foo()/ST.Bar())

This is the default behavior, with no error handling logic in A or B. Of course, one can put a try/catch within A.X() and/or B.Y() to analyze the error and retry or report it or to alternate execution of the method, e.g. write to a queue in case the primary storage is unavailable.

We believe this is a very important feature of Orleans - distributed and asynchronous propagation of exceptions. It makes reasoning about errors in a distributed app almost comparable to it in a single process app and allows developers to write the minimum amount of code for most common cases, as in the example above - put error handling logic only at the FE level. With pure one-way message passing such simplicity is pretty much impossible to achieve.

@rkuhn
Copy link
Contributor Author

rkuhn commented Jun 2, 2015

@sergeybykov Yes, you’re right: if the service’s model fits request–response well then that gives you a convenient handle on error propagation, as noted in the comparison document. My point is that assuming an exception to be an error (as opposed to a failure) unless stated otherwise can misguide system design, and I certainly do not agree that all error handling logic should be moved up to the FE level. What should be done at that level is user input validation, and if valid inputs lead to errors further down then someone did a mistake internally—which means service failure and not user error.

@sergeybykov
Copy link

@rkuhn I didn't mean that error handling should be done at the FE level. It can definitely be done at any level in the call chain that makes sense for the app. What we see in practice in interactive services though is that it's a rather rare case that a useful error handling can be done at lower levels.

For example, a retry can be attempted by any actor in the call chain. However, in most cases the lower levels cannot know if a retry is desirable by the scenario, which only the top layers in the call chain, FE or not, may know.

I'm not sure I agree with your black and white picture of errors vs. failures. In a distributed system sometimes you simply don't know for sure. A socket error may indicate a failure of the node on the other end or just a temporary network glitch. From the application perspective they are indistinguishable until we learn that this is in fact a failure of the remote node. In the meantime, the app usually cannot wait for such a fact to be established, and has to treat this ambiguous situation as an error.

@rkuhn
Copy link
Contributor Author

rkuhn commented Jun 3, 2015

We should probably meet sometime and discuss this over a beverage of your choice ;-)

One more thought here is that the black/white separation’s feasibility depends on the definitions: it gets pretty clear and simple if you apply HTTP thinking, as in 4xx status codes mean “you did something wrong” (i.e. an Error) while 5xx status codes mean “I did something wrong” (i.e. a Failure). Then it does not matter whether I know that another remote node is down or not—if I cannot provide my service then it is a Failure on my part. This keeps all observations and their reactions nicely local, which is a big plus in distributed systems.

@sergeybykov
Copy link

Definitely. So many questions can be sorted out over a drink. :-) Let me know when you travel to Seattle next time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants