-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/transform] - Log processing capabilities #9410
Comments
This is an awesome write-up. My main take away is that there is quite a bit of overlap between logs-collection and transformprocessor, with logs-collection being the more mature, feature-rich component. If the goal is to consolidate processors into the transform processor then logs-collection should be merged in, with gaps in functionality in the transform processor being filled so that the logs-collection processor works as expected. This will overall make the transform processor more capable for other signals as well. But that is a very significant undertaking. Of the all the processors that have been mentioned as merging into the transform processor, it feels like log-collection is the most complex. I'm not sure it makes sense to make logs-collection the first merger. |
I am starting to take a look at this, starting with the json_parser functionality. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@djaglowski The Is there any plan to add |
@bencehornak-gls in the transform processor you want to use the ExtractPatterns converter. |
Thanks for the tip, @TylerHelmuth, that helped a lot! I overlooked this converter. |
This is a very rough draft analysis of how the transformprocessor could be enhanced to support the log processing capabilities of the log-collection library. Certainly more careful design would be warranted, but the suggestions and examples are a starting point for conversation.
Path expressions vs "field syntax"
log-collection defines a field syntax that is very similar to
transformprocessor
's "path expression". However, it allows for the ability to refer to nested fields in attributes and body. This would be a very important capability for parity.Expressions
log-collection exposes an expression engine. This could be represented as a new function called
expr()
, which would typically be composed into other functions:set(attributes["some.attr"], expr(foo ? 'foo' : 'no foo'))
Alternately, it may be possible to provide equivalent functions for the same capabilities.
Parsers
log-collection's generic parsers like
json
,regex
,csv
, andkeyvalue
could be represented as functions. These all produce amap[string]interface{}
, which could then be set as desired:parse_json(body, attributes)
parse_regex(body, '/^.....$/', attributes["tmp"])
A common pattern is to "embed" subparsers into these generic parsers, with the primary benefit being that they only execute if the primary parsing operation succeeded. Possibly this could be represented with some kind of conditional sub-query concept:
parse_json(body, attributes)
strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
Moving values around
set
is equivalent toadd
retain
is roughly equivalent tokeep_keys
, but there appear to be some nuanced differences. Need to look into this more.copy(from, to)
move(from, to)
remove(attributes["one"], attributes["two"])
flatten(attributes["multilayer.map"])
Timestamps
log-collection supports multiple timestamp parsing syntaxes, namely
strptime
,gotime
, andepoch
(unix). These would translate fairly easily to functions:strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
gotime(attributes["time"], Jan 2 15:04:05 MST 2006)
unixtime(concat(attributes["seconds"], ".", attributes["seconds"]), "s.ns")
Severity
log-collection provides a very robust mechanism for interpreting severities, which may be difficult to represent in the syntax of this processor. The main idea of the system is that severity is interpreted according to a mapping. Several out-of-the-box mapping are available, and the user can layer on additional mappings as needed. This gives a concise configuration, and the implementation can be highly optimized (single map lookup, instead of iteration over many queries).
One way to represent these same capabilities would require a class of functions that produce and/or mutate severity mappings:
sevmap_default()
sevmap_with(sevmap_empty(), as_warn("w", "warn", "warning", "hey!"), as_error("e", "error", "err", "noo!"))
sevmap_with(sevmap_http(), as_fatal(404))
Conditionality
if(condition, run(q1, q2, q3))
if(parse_json(body, attributes), run(strptime(...), severity(...)))
Routing
log-collection supports a
router
capability, which allows users to apply alternate processing paths based on any criteria that can be evaluated against individual logs. A brute force equivalent would be to apply the same where clause repeatedly:strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y") where body ~= some_regex
severity(attributes["sev"], stdsevmap()) where body ~= some_regex
gotime(attributes["timestamp"], "Jan 2 15:04:05 MST 2006") where body ~= other_regex
severity(attributes["status.code"], httpsevmap()) where body ~= other_regex
Resource and scope challenges
Logs often contain information about the resource and/or scope which must be parsed from text. Isolating and setting these values is fairly straightforward when working with a flat data model, such as is used in log-collection, but it's not clear to me whether the pdata model will struggle with this.
For example, suppose a log format is shaped like
resource_name,scope_name,message
. Should/doestransformprocessor
create a newpdata.ResourceLogs
for each time a resource attribute is isolated? Should it cross reference with existing resources in thepdata.ResourceLogsSlice
and combine them? Could it do this performantly? How many log processing functions could trigger this kind of complication? (eg.move(attributes["resource_name"], resource["name"])
).Need to give more thought to this area especially.
The text was updated successfully, but these errors were encountered: