-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit logging structured entries #31046
Comments
Pinging @elastic/es-security |
When we have discussed structured logging in the past, we had agreed to go down the path of structured logging instead of JSON because it is human readable yet machine parseable, and is more succinct. I think it'd be a shame to have to use JSON for the audit logs. I understand the issues with escaping, and I don't think we want several regexes to parse different log files. Instead I'm thinking of something like:
So we can extract the message plus each individual field using the same process for every log line. @rjernst You use structured logging in a previous job - you must have faced similar problems. Any suggestions for this? |
@clintongormley in the context of my pretentious lingo about petty matters, JSON is a structured logging example. Basically the difference between templated and structured logging is if values are accompanied by the field name.
The JSON will not be nested, so a one line entry (unformatted with spaces) should still be readable by the typical IT admin eye. I am leaning towards the flat JSON with the text "message" field because it is the canonical way (parsers don't require Regexp), although it loses on the conciseness field. WDYT? |
My personal experience here is that at my last job I used winston using the default JSON formatting feeding into papertrail. I found the lack of conciseness to be extremely annoying, and as a business we never derived any particular value from the JSON structure. that is obviously anecdotal. I set it up that way because it seemed like a good idea at the time, but I would think twice before doing it again. Also, since it's not trivial for things like postgres to log in JSON format, anyone using an alternate or additional log aggregation service is going to likely be dealing with templated logs regardless. And in the case of audit logging, there will very often be a 3rd party aggregation service involved somehow (certainly this was the case when I worked at a large investment bank). I'm a bit curious why we can't support both formats via config? |
It's still not fantastically human readable, and we do want to optimise for readability so that sysadmins don't hate us at 3am. The most important thing for me is that we use the same style of logging across all logs generated by Elasticsearch. Think about what a stack trace would look like in JSON? Awful! /cc @elastic/es-core-infra for opinions.
Of course this is a possibility, but it'd be much nicer to be able to satisfy both needs out of the box, so that centralised logging doesn't require config changes that then make local access unreadable. |
I'm torn on which is more readable, json or text. In my previous use of structured logging, it was for search request timings (among other things). It was a multi line format with an end of record marker and became hard to read because it was packed together so tightly with no extra spacing. The format would roughly look something like:
So there was a general set of "timers" (which could change in code based on the request/parts of the code executed), and the times were always in milliseconds. I believe there were some other generalized keys with subkeys, but the timers I think is the most illustrative for my point: having types of the values based on structure (eg all subkeys of "timers") makes the server side a lot more flexible. Obviously this could be in json as well, it would just be more verbose. In some cases (large entries), the json is probably easier to read (if pretty printing is used). I would just make sure in this case that the superfluous formatting (ie spaces/newlines) are removed before transferring the records. |
I should also mention my use was a dedicated log for search requests, while I think part of our dilemma is we have multiple logs for different purposes that we are trying to have a unified format for. IMO logs with general messages should look different than request logs, etc. |
I am concerned by the fact that this discussion is being driven on a targeted issue (structured logging for audit logging) rather than more broadly within Elasticsearch. Is there a reason that we are approaching this from this angle? |
I have tried to scope my work to achieve the removal of audit indexing #29881. To that end, only logfile audit has to be machine readable so that it's easy to parse it (without utilizing multiple regexp). The topic did touched matters that are not audit logfile specific, such as stacktraces. IMO log formats can differ: request, slowlog, audit can have different formats, although I agree it would be desirable that all logs are templated or all logs are JSON-like. In the light of this, given that in the current implementation, we are mostly on the templated bandwagon with all the logs (and no JSON-like), I am now swayed to the templated proposal that @clintongormley proposed, which has names for fields, so that a single regexp is required to parse all entries, and there is no possibility for multiple ones to overlap. It is also readable (entries are phrases instead of tabular) being alike to the general log. |
I think the only reason is that the audit log is the first one to get an overhaul, so it serves as a test case.
@albertzaharovits I'm not saying we MUST go down this route, but I would appreciate at least some investigation to see if it is feasible. We still have the escaping issue to sort out, but that may be as simple as escaping If this ends up proving too difficult or unusable, then I'm OK going back to the JSON route. |
Given the "Filebeat module for Elasticsearch" effort elastic/beats#5301, and specific to the audit logfile in elastic/beats#7365 , I think @ruflin and @radoondas might help weighing in trade offs from an ingest perspective. |
From an ingest perspective I would hope that any new format does not require to use grok patterns but that we can use dissect instead: https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html I'm linking to the Logstash docs as they are more extensive but the dissect implementations on the Beats and Logstash side are identical. JSON is nice from a machine perspective but also has it's downside which we see now in Kibana as we don't know in advance what fields will show up in the log line as any plugin can add it's own fields. For the unkown parts of a log line having it all in one field and let Elasticsearch tokenize it is very powerful. |
I need to work on a POC, but I think we can meet all the demands, i.e. unambiguous and parsable by both man and machine. I think we can use working on a POC, will come back with details! |
Filebeat will have to support whatever the default is. So even if we have JSON option which is off by default it will mean Filebeat has to implement and maintain the grok patterns. One thing we started experimenting with is mixing grok and dissect. Doing all the basic work with dissect and only have a simple grok patterns for the parts which can't be covered by dissect. Looking forward to the POC. |
The structured audit log events PR #31931 is plodding forward and I can now run most tests. That PR constructs audit event entries as log4j's After some internal mulling over the most clear and extensible log event format, I wish to gather some external input on this matter. In a few words, I think JSON structured entries for the audit log is the most concise and readable format, even considering that it is read exclusively by humans. In addition, the real value comes from easy machine ingestion. This is in contrast to the other proposed format of 'templated' entries, i.e. entries with values interspersed with text. Foreword: I think it's best if I start with some examples of the current entry format followed by the proposed entry format. Also, it's worth keeping in mind that the audit log trail is a separate file from all the other logs. Current audit events (with randomization):
Proposed audit events:
Some things to note about the proposed alternative:
My argumentation rests on the opinion that templated entries (that have text interspersed between these values) do not add value and on the contrary might hurt the eye of the human administrator. For example, for the
It is not really an improvement. Maybe, for a first time user that had not heard about I might agree that we could add another field to the JSON proposed format containing a proper phrase. But again, I don't see the value, but maybe I am too deep into this stuff and it's just my narrow vision on the subject. I want to conclude by underscoring that this trail should be in a separate file, with different rotation and ingestion preferences. It does not contain stacktraces or suggestions that human (users or administrators) should follow. |
There is an other argument for using I think there is also an opportunity on the UI side to make JSON logs more human readable friendly to get the best of both worlds. |
Just had a chat with @albertzaharovits and I agree that the templated approach is a poor cousin to just using JSON in the case of the audit log. Ideally, I'd like to keep some human readability by having eg the timestamp and severity as the first two positional arguments, followed by the JSON string. @ruflin how feasible would it be to parse a format like this in Beats?
|
This should definitively be doable. Haven't tested it but I think a processor would look as following on the Beats side:
Note: Are the [] brackets for the last part needed? Was just thinking that things could become more complicated if there is a ] inside the json somewhere. |
If we take the JSON path, I think it would be helpful (if possible) to make sure the keys are ordered in a logical way. |
Here is a generous assortment of log lines, as they are generated in the #31931 PR :
These are the complete formatted entries, there is no other preamble or sufix.
Besides the low level questions, such as field names, order and data types, which are tracked in the #31931 PR, there is an important question still to be discussed: |
I think we should be reluctant to make this sort of change to the I would lean towards:
|
I tend to agree with But I don't like to have another
Enabling both is probably suboptimal because of wasted resources. Ultimately nobody would use both. I think we should enable the new appender only, the previous one being commented out. This way we are more explicit that something changed (the file is not there as opposed to the content format changed), and if this comes as a surprise (because of not reading the release change notes) users can switch on the previous behavior at runtime, changing the log4j file. If we're defaulting to enabling the existing format, and commenting out the new one, we are implicitly delaying it's adoption (untill we decide we don't want our stack to parse the existing format, since beats has to support the default), as I don't think users will like to tinker with log4j files if everything "works". In a way, I propose to force the new format onto our users, and be very obvious about it, while at the same time, making it easy for them to keep using the deprecated format. |
I suggest we ship with both outputs enabled for 6.x and remove the old output for 7. I think there is even a possibility that we can programmatically get the appenders for that logger, see if the old format is enabled, and log the deprecation message in 6.x. Existing consumers will work and the beats consumer will work as well with the downside of more disk usage and i/o. |
To facilitate machine readability, log entries should be structured in a dictionary-like format. Structured logging is also known as semantic or typed logging. Examples of such structured formats are: key value pairs, JSON and XML. In general, flat text entries might be more concise and trade ambiguity for human readability. Machine readable entries are more verbose and explicit, shunning ambiguity.
The status quo for the audit logging is a mixture of the templated and the structured entry type. In general, the templated entry type is a good tradeoff between user readability and the dictionary-like format. It is not ambiguous and is also more concise compared to the structured type. Here is an example entry from the current implementation:
Unfortunately, in the current implementation, entries are ambiguous because the set of allowed characters in user and role names includes '=', ' ', '[' and ']'. Fixing this would allow for unambiguous templated log entries.
Unambiguous means machine readable, i.e. you can write a regex to parse a specific field, so why bother with the verbose structured format? The answer is that the next step after parsing is indexing, and indexing requires that event fields have a name and a type. The parser of a templated entry type infers the field name and type from the token's position. But there may be several entry types with different positional parsing required, so one has to create several patterns. But what if patterns overlap (several can match the same line), or more commonly, what if parsers assign the same field name to fields with different data types. This is very possible because the code which generates entries and the code which parses and assigns type names are in different projects. For that matter, it is also relatively easy for the generating and the parsing pieces to go out of sync with each other, requiring integration tests...
Lecture over, here is the proposal:
The format of the entries should be JSON objects, UTF-8 encoded. This format is dictionary-like, unambiguous, where each field has a name and special characters in values are escaped. The new format will coexist alongside the present one, in an separate file trail.
Field names should align to the elastic common schema (ECS). ECS encourages dot notation for field names, i.e. related fields are similarly prefixed. This proposal will follow this suggestion, but note that entries will not be nested, values are not dictionaries.
The stretched goal is to have the format configurable, so even SYSLOG message format from RFC5424 might be supported. One possible avenue for this are messages in log4j 2.
This proposal is a narrowing of #8786 to the scope of the audit log.
Relates to #29881 .
The text was updated successfully, but these errors were encountered: