Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering audit log #6077

Closed
SebastianStehle opened this issue Oct 30, 2019 · 10 comments
Closed

Clustering audit log #6077

SebastianStehle opened this issue Oct 30, 2019 · 10 comments
Labels
stale Issues with no activity for the past 6 months
Milestone

Comments

@SebastianStehle
Copy link
Contributor

Hi,

I would like to open a discussion:

For me the clustering is sometimes hard to understand and it would be very good to have some kind of an audit log like in kubernetes with high level messages that tell me the status of the cluster, such as:

  • Member (IP) has joined the cluster.
  • Member (IP) is not reachable from (IP), (IP) and ...

This could be very valuable because reading the logs can be hard and in case of cluster errors you can have thousand of consecutive errors that you need to filter out.

Just this morning my k8 has restarted my nodes because the health check became unhealthy and do not really understand the reason.

It would be great to have a IAuditLog interface that just logs it to the logger, but can also be used to persist it to a database and then consumed by OrleansDashboard or so.

In case you are writing a product for others users like Squidex (https://github.com/squidex/squidex) it is hard to get good default values for the logs. Either it is too much or too less.

Or we need to be more careful with logging categories, because class names are not a good source for categories, especially when potentially dozens of classes are involved to do a job like clustering.

To be honest, I have no idea what something like this means:

 Silo S10.0.0.22:11111:309992883 is rejecting message: Request S10.0.0.22:11111:309992883*stg/13/0000000d@S0000000d->S10.0.2.132:11111:309992903*stg/10/0000000a@S0000000a #1069522: . Reason = Recent (00:00:00.0755821 ago, at 2019-10-30 06:20:30.534 GMT) connection failure trying to reach target silo S10.0.2.132:11111:309992903. Going to drop Request msg 1069522 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
@Zeroshi
Copy link

Zeroshi commented Oct 30, 2019

That's a great idea @SebastianStehle! This would have a significant impact on debugging and performance tuning.

@veikkoeeva
Copy link
Contributor

@SebastianStehle To add to the thought about product to others, SIEM with log message identifiers (one per message type, e.g. https://www.vulpoint.be/siem-and-windows-event-logs-id/) and having trace identifier also. Some events need likely to be reacted with separately agreed processes which probably include outsourced people not that familiar other than do certain things when something happens.

@sergeybykov
Copy link
Contributor

@SebastianStehle, do you mean for this log to include a subset of events that MembershipOracle logs today? Or is it something else?

@sergeybykov sergeybykov added this to the Triage milestone Oct 31, 2019
@sergeybykov sergeybykov self-assigned this Oct 31, 2019
@SebastianStehle
Copy link
Contributor Author

SebastianStehle commented Oct 31, 2019

I don't know yet. It should be something high level, but whatever is useful.

Lets consider you install an application like a clustered XMPP Server. Then you are not interested in all the details and just want to see what is going on.

I would like to have a page in the Dashboard that shows a network diagram with my cluster member and a slider in the footer where I can replay the audit log to understand what has changed.

About my problem yesterday morning: I actually found out that at the same time MongoDB also reported Network errors.

@veikkoeeva
Copy link
Contributor

veikkoeeva commented Oct 31, 2019

To chime in quickly, it may be a good idea to have something like IAuditLog interface that sees the local messages pass through and can perhaps collect high level events from that information. I have not done research if this would be possible with the current .NET Core logging and eventing pipeline without an explicit interface and even if so, if there could be an Orleans facing piece.

This is not Orleans specific: The case I have come across is that the system emits events to Windows Event Log, for instance, and there is another tool that then collects and forwards them to some other system. Then, as for an example, there are instructions (or processing pipeline leading to instructions) that events with certain severity immediately cause some sort of action in support functions. Depending on the event type it can be routed to different to technical staff or customer support or some other place and often these functions are detached from the original system unless it's a technical condition that need to remedied, perhaps the software team contacted.

It could be useful to have a way to also collect and pick some events to the tool itself so it can react. Or the developers can have a dashboard they can react if they are looking into it. Sometimes, often perhaps, the developers do not have access to the Event Log so they may log some things to database for troubleshooting since more often it's possible to have limited access to that. Sometimes if the customers has actually wanted it, as may be the case in larger installations involving more critical functions, data will be moved to https://www.elastic.co/what-is/elk-stack kind of systems for further processing and correlation. It would be great if there were a well defined and systematic way to create tooling on top of Orleans and if it somehow could ride on top of .NET Core facilities for the most part (I don't know how much there is there to do this).

In addition to all this, this same pipeline could collect grain call information and all that for processing and shown in the tooling. Kind of a wirehose with events with trace identifiers etc. This what I've written is slightly different what @SebastianStehle wrote to start this, though maybe useful if I pile on a bit. Some of this is discussed also at https://twitter.com/davidfowl/status/1189031080076009472 in the bit long thread of actor systems and why they may feel difficult to troubleshoot in production (there are other reasons too).

I also cross-referene aspnet/Logging#612 for a collection of links for enterprise type event correlation (e.g. aspnet/Logging#612 (comment)) and #4992 that is the most recent thread on one technical aspect of this case. Then there is the sample https://github.com/dotnet/orleans/tree/master/Samples/OneBoxDeployment that considers many aspects of this in the context of cyber-physical systems.

Maybe it is time to sketch a bit of roadmap and Gitter discussion about this? :)

@SebastianStehle
Copy link
Contributor Author

For the most important point is: It should be built in. In k8 I can just check the dashboard or kubectl describe and I can find out what is going on and I can dig in deeper with the logs and using tools like ELK. But it is not easy to setup good monitoring and alerting, especially for end users.

@sergeybykov
Copy link
Contributor

What I'm trying to understand is if a stream of cluster membership events would be sufficient, for example, like what @olegbsky is printing in his test app.

Or do you have something very different in mind?

@SebastianStehle
Copy link
Contributor Author

Yes, this should be fine. Perhaps we can combine it with something like snapshots. So that each member s logs it current view of the cluster to an interface.

What we can then make is to get a unified view of the cluster at any point of time.

@sergeybykov sergeybykov modified the milestones: Triage, Backlog Sep 26, 2020
@sergeybykov sergeybykov removed their assignment Sep 26, 2020
@ghost ghost added the stale Issues with no activity for the past 6 months label Dec 11, 2021
@ghost
Copy link

ghost commented Dec 11, 2021

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

@ghost
Copy link

ghost commented Mar 4, 2022

This issue has been marked stale for the past 30 and is being closed due to lack of activity.

@ghost ghost closed this as completed Mar 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 4, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues with no activity for the past 6 months
Projects
None yet
Development

No branches or pull requests

4 participants