-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering audit log #6077
Comments
That's a great idea @SebastianStehle! This would have a significant impact on debugging and performance tuning. |
@SebastianStehle To add to the thought about product to others, SIEM with log message identifiers (one per message type, e.g. https://www.vulpoint.be/siem-and-windows-event-logs-id/) and having trace identifier also. Some events need likely to be reacted with separately agreed processes which probably include outsourced people not that familiar other than do certain things when something happens. |
@SebastianStehle, do you mean for this log to include a subset of events that |
I don't know yet. It should be something high level, but whatever is useful. Lets consider you install an application like a clustered XMPP Server. Then you are not interested in all the details and just want to see what is going on. I would like to have a page in the Dashboard that shows a network diagram with my cluster member and a slider in the footer where I can replay the audit log to understand what has changed. About my problem yesterday morning: I actually found out that at the same time MongoDB also reported Network errors. |
To chime in quickly, it may be a good idea to have something like This is not Orleans specific: The case I have come across is that the system emits events to Windows Event Log, for instance, and there is another tool that then collects and forwards them to some other system. Then, as for an example, there are instructions (or processing pipeline leading to instructions) that events with certain severity immediately cause some sort of action in support functions. Depending on the event type it can be routed to different to technical staff or customer support or some other place and often these functions are detached from the original system unless it's a technical condition that need to remedied, perhaps the software team contacted. It could be useful to have a way to also collect and pick some events to the tool itself so it can react. Or the developers can have a dashboard they can react if they are looking into it. Sometimes, often perhaps, the developers do not have access to the Event Log so they may log some things to database for troubleshooting since more often it's possible to have limited access to that. Sometimes if the customers has actually wanted it, as may be the case in larger installations involving more critical functions, data will be moved to https://www.elastic.co/what-is/elk-stack kind of systems for further processing and correlation. It would be great if there were a well defined and systematic way to create tooling on top of Orleans and if it somehow could ride on top of .NET Core facilities for the most part (I don't know how much there is there to do this). In addition to all this, this same pipeline could collect grain call information and all that for processing and shown in the tooling. Kind of a wirehose with events with trace identifiers etc. This what I've written is slightly different what @SebastianStehle wrote to start this, though maybe useful if I pile on a bit. Some of this is discussed also at https://twitter.com/davidfowl/status/1189031080076009472 in the bit long thread of actor systems and why they may feel difficult to troubleshoot in production (there are other reasons too). I also cross-referene aspnet/Logging#612 for a collection of links for enterprise type event correlation (e.g. aspnet/Logging#612 (comment)) and #4992 that is the most recent thread on one technical aspect of this case. Then there is the sample https://github.com/dotnet/orleans/tree/master/Samples/OneBoxDeployment that considers many aspects of this in the context of cyber-physical systems. Maybe it is time to sketch a bit of roadmap and Gitter discussion about this? :) |
For the most important point is: It should be built in. In k8 I can just check the dashboard or kubectl describe and I can find out what is going on and I can dig in deeper with the logs and using tools like ELK. But it is not easy to setup good monitoring and alerting, especially for end users. |
Yes, this should be fine. Perhaps we can combine it with something like snapshots. So that each member s logs it current view of the cluster to an interface. What we can then make is to get a unified view of the cluster at any point of time. |
We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement. |
This issue has been marked stale for the past 30 and is being closed due to lack of activity. |
Hi,
I would like to open a discussion:
For me the clustering is sometimes hard to understand and it would be very good to have some kind of an audit log like in kubernetes with high level messages that tell me the status of the cluster, such as:
This could be very valuable because reading the logs can be hard and in case of cluster errors you can have thousand of consecutive errors that you need to filter out.
Just this morning my k8 has restarted my nodes because the health check became unhealthy and do not really understand the reason.
It would be great to have a IAuditLog interface that just logs it to the logger, but can also be used to persist it to a database and then consumed by OrleansDashboard or so.
In case you are writing a product for others users like Squidex (https://github.com/squidex/squidex) it is hard to get good default values for the logs. Either it is too much or too less.
Or we need to be more careful with logging categories, because class names are not a good source for categories, especially when potentially dozens of classes are involved to do a job like clustering.
To be honest, I have no idea what something like this means:
The text was updated successfully, but these errors were encountered: